Tagged articles

Operations

3329 articles · Page 16 of 34

Apr 29, 2022 · Operations

How 58 Home Service Standardized Cleaning: From User Research to SOP Success

This article examines how 58 Home Service identified service gaps through user research, built a detailed user‑experience map, created a comprehensive SOP handbook covering image, etiquette, and behavior, and implemented training, assessment, and incentives to dramatically improve customer satisfaction and reduce complaints.

OperationsTrainingquality assurance

0 likes · 9 min read

How 58 Home Service Standardized Cleaning: From User Research to SOP Success

DaTaobao Tech

Apr 29, 2022 · Industry Insights

How Taobao Mini Programs Cut Load Times by 30%: A Data‑Driven Performance Playbook

This article analyzes the performance challenges of Taobao Mini Programs, defines a multi‑dimensional experience metric, builds a standardized ops data pipeline, introduces the T2 first‑screen algorithm and a three‑stage performance model, and shares concrete optimization practices that reduced T2 from 2.7 s to 1.9 s while improving business metrics.

Mini ProgramOperationsOptimization

0 likes · 10 min read

How Taobao Mini Programs Cut Load Times by 30%: A Data‑Driven Performance Playbook

DevOps Cloud Academy

Apr 26, 2022 · Operations

Understanding DevOps: Principles, History, Benefits, and Implementation Strategies

This article explains the core principles of DevOps, its historical development, the advantages it brings to software delivery, and practical steps for organizations to adopt a collaborative DevOps culture between development and operations teams.

CollaborationContinuous DeploymentOperations

0 likes · 9 min read

Understanding DevOps: Principles, History, Benefits, and Implementation Strategies

Bilibili Tech

Apr 26, 2022 · Operations

Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

Bilibili’s SRE team combines stability theory, detailed fault‑stage and operational metrics, and a unified emergency‑response platform—including on‑call scheduling, fault‑command incident commanders, automated fault portraits, and rapid post‑mortems—to transform frequent incidents into data‑driven, collaborative recoveries and lay groundwork for AI‑assisted self‑healing.

Business StabilityOncallOperations

0 likes · 23 min read

Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

Efficient Ops

Apr 26, 2022 · Operations

How Beijing Gas Achieved Advanced DevOps Maturity: A Detailed Case Study

Beijing Gas’s Tongzhou Call Center project passed the Level 2 DevOps continuous‑delivery assessment, showcasing how standardized processes, a cloud‑native tool platform, and agile practices dramatically improved delivery speed, quality, and digital transformation across the organization.

Case StudyMaturity ModelOperations

0 likes · 11 min read

How Beijing Gas Achieved Advanced DevOps Maturity: A Detailed Case Study

Efficient Ops

Apr 26, 2022 · Operations

How China’s Top Firms Achieved Leading DevOps Maturity – Assessment Insights

The CAICT’s fourth‑batch DevOps assessment reveals that China Zhongjin Wealth’s platform passed the Excellent level, showcasing how standardized pipelines, tool empowerment, and the DevOps Capability Maturity Model dramatically boost delivery speed, quality, and competitiveness across major enterprises.

Capability Maturity ModelOperationsStandardization

0 likes · 6 min read

How China’s Top Firms Achieved Leading DevOps Maturity – Assessment Insights

dbaplus Community

Apr 25, 2022 · Operations

From Monitoring to Observability: Expert Insights on Evolving Cloud‑Native Operations

In this interview series, three industry experts explain how monitoring differs from observability, the shifts required for ops, developers, and architects, the core methodologies and technologies behind metrics, traces, and logs, and practical guidance for selecting and integrating observability tools in cloud‑native environments.

ObservabilityOperationscloud-native

0 likes · 16 min read

From Monitoring to Observability: Expert Insights on Evolving Cloud‑Native Operations

DevOps Cloud Academy

Apr 24, 2022 · Operations

Improving Communication Between Development and Operations Teams: A DevOps Cultural Guide

This article explains how adopting a DevOps culture—by forming true functional teams, shortening release cycles, and aligning shared goals—can resolve the conflicting objectives of developers and operations, enhance communication, and enable more frequent, higher‑quality software deliveries.

Functional TeamsOperationscommunication

0 likes · 7 min read

Improving Communication Between Development and Operations Teams: A DevOps Cultural Guide

Alibaba Cloud Infrastructure

Apr 24, 2022 · Operations

How Alibaba Cloud Guarantees Millisecond DNS Reliability with Automated Ops

The article examines Alibaba Cloud's DNS operation platform, detailing its three‑stage evolution—standardization, automation, and intelligent automation—and how these practices achieve sub‑10 ms latency stability, zero‑downtime fault isolation, and scalable reliability for billions of daily queries.

AutomationCloudDNS

0 likes · 9 min read

How Alibaba Cloud Guarantees Millisecond DNS Reliability with Automated Ops

Cognitive Technology Team

Apr 24, 2022 · Backend Development

Thread Pool Misconfiguration Cases and Best Practices for Resilience

The article presents two 2018 incidents where improper Java thread‑pool settings caused service degradation and unavailability, analyzes the root causes such as insufficient core size, unbounded queues, and missing rejection handlers, and offers practical recommendations for dynamic sizing, alerting, degradation strategies, isolation, and auto‑scaling to prevent similar faults.

FaultToleranceJavaConcurrencyOperations

0 likes · 3 min read

Thread Pool Misconfiguration Cases and Best Practices for Resilience

DevOps Cloud Academy

Apr 23, 2022 · Operations

A Comprehensive Overview of DevOps Tools and Their Roles

This article introduces the DevOps culture and systematically categorizes a wide range of DevOps tools—including source‑code management, CI/CD, containers, cloud providers, automation, monitoring, project management, and secret management—to help teams improve productivity and collaboration.

AutomationCI/CDContainers

0 likes · 9 min read

A Comprehensive Overview of DevOps Tools and Their Roles

TAL Education Technology

Apr 21, 2022 · Databases

Applying Orchestrator for High‑Availability MySQL in TAL Education Group’s Database System

This article describes how TAL Education Group evaluated, selected, and customized the open‑source Orchestrator tool to build a highly available, secure, and extensible MySQL HA solution that meets 99.99% uptime, data‑integrity, cross‑datacenter, and operational automation requirements.

Database ArchitectureHigh AvailabilityOperations

0 likes · 9 min read

Applying Orchestrator for High‑Availability MySQL in TAL Education Group’s Database System

Liangxu Linux

Apr 20, 2022 · Operations

How to Quickly Identify Disk Space Hogs on Linux Servers

When a Linux server raises a disk‑space alarm, this guide shows step‑by‑step how to locate the offending directories or files using df, du, find, lsof and tune2fs, and explains why reported usage may differ from summed directory sizes.

Operationsdufind

0 likes · 4 min read

How to Quickly Identify Disk Space Hogs on Linux Servers

IT Architects Alliance

Apr 17, 2022 · Operations

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

OncallOperationsSLI

0 likes · 21 min read

Understanding the SRE Role: Responsibilities, Types, and Practices

Architect

Apr 16, 2022 · Operations

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

This article explains what SRE is, why it was created, how its responsibilities differ across companies, and breaks the work into Infrastructure, Platform, and Business SRE while covering deployment, on‑call, SLI/SLO, incident post‑mortems, capacity planning, user support, and career advice.

OncallOperationsSLI/SLO

0 likes · 22 min read

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

Alibaba Cloud Native

Apr 16, 2022 · Cloud Native

How AHAS Feature Switches Simplify Dynamic Configuration in Cloud‑Native Microservices

This article explains common configuration challenges in microservice applications and introduces Alibaba Cloud's AHAS feature switch as a lightweight, dynamic configuration framework that offers zero‑code integration, strong type validation, persistent storage, and non‑intrusive deployment for real‑time business control.

AHASFeature SwitchOperations

0 likes · 8 min read

How AHAS Feature Switches Simplify Dynamic Configuration in Cloud‑Native Microservices

YunZhu Net Technology Team

Apr 15, 2022 · Operations

Design and Architecture of a Cloud‑Native Monitoring Platform for Business Systems

The document outlines the background, vision, current status, technical research, value, product and technical architecture, and functional design of a cloud‑native monitoring platform that integrates SkyWalking and Prometheus to provide comprehensive APM, resource utilization, alerting, and rapid fault localization for business and technical middle‑platform services.

APMObservabilityOperations

0 likes · 10 min read

Design and Architecture of a Cloud‑Native Monitoring Platform for Business Systems

IT Architects Alliance

Apr 12, 2022 · Operations

Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices

This article explains Site Reliability Engineering (SRE), covering its origins, core responsibilities, key concepts such as SLI/SLO/SLA and error budgets, the four golden monitoring metrics, risk analysis, and practical guidance on building reliable services using tools like Prometheus and Grafana.

Error BudgetOperationsSLI

0 likes · 15 min read

Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices

MaGe Linux Operations

Apr 9, 2022 · Operations

Master Server Log Analysis: 20 Essential Linux Commands to Uncover Traffic, Errors, and Performance

This guide compiles a comprehensive set of Linux commands—using awk, grep, netstat, and other tools—to help you count unique IPs, rank page visits, filter bots, monitor connection states, calculate bandwidth, and identify high‑traffic or error‑prone resources from Apache or Nginx logs.

LinuxOperationsawk

0 likes · 12 min read

Master Server Log Analysis: 20 Essential Linux Commands to Uncover Traffic, Errors, and Performance

Dada Group Technology

Apr 8, 2022 · Operations

Marketing Guard: A Risk Pre‑Warning System for E‑Commerce Marketing Operations

The article presents a comprehensive analysis of marketing‑related financial loss cases, outlines the design and implementation of a non‑intrusive, event‑driven Marketing Guard system with dual‑layer ES‑HBase storage, and discusses its operational safeguards, achievements, shortcomings, and future development plans.

Operationsmarketing riskrisk prewarning

0 likes · 12 min read

Marketing Guard: A Risk Pre‑Warning System for E‑Commerce Marketing Operations

Architecture Digest

Apr 6, 2022 · Operations

Why Organizations Struggle with DevOps: Leadership, Structure, Value‑Stream Mapping and Key Practices

The article explains that many organizations fail to achieve the promised business value of DevOps because they overlook four critical factors—leadership, organizational structure, value‑stream mapping, and regular pulse checks—and provides concrete recommendations to address each area.

OperationsValue Stream Mappingorganizational structure

0 likes · 9 min read

Why Organizations Struggle with DevOps: Leadership, Structure, Value‑Stream Mapping and Key Practices

Top Architect

Apr 4, 2022 · Operations

Monitoring Nginx with nginx‑status, Telegraf, InfluxDB, and Grafana

This guide explains how to enable the nginx‑status module, configure Nginx to expose metrics, collect them with Telegraf, store them in InfluxDB, and visualize the data in Grafana, providing a complete end‑to‑end monitoring solution for Nginx servers.

GrafanaInfluxDBNginx

0 likes · 5 min read

Monitoring Nginx with nginx‑status, Telegraf, InfluxDB, and Grafana

Open Source Linux

Apr 2, 2022 · Operations

How to Speed Up Call Center Incident Recovery with Proven Ops Strategies

This article walks through a real call‑center outage scenario, outlines systematic fault‑identification steps, practical emergency recovery actions, monitoring enhancements, concise emergency‑plan design, and introduces intelligent event‑handling to help operations teams resolve incidents faster and more reliably.

AutomationIncident ManagementOperations

0 likes · 13 min read

How to Speed Up Call Center Incident Recovery with Proven Ops Strategies

Architecture Digest

Apr 1, 2022 · Backend Development

Why Starting a New Project with Microservices Is Usually a Bad Idea – Monoliths Are Your Friend

The article argues that launching a brand‑new project using microservices often incurs excessive infrastructure, cultural, and operational costs that outweigh the touted benefits, and suggests that a well‑designed modular monolith can be a more pragmatic alternative for many teams.

DeploymentOperationsfault isolation

0 likes · 13 min read

Why Starting a New Project with Microservices Is Usually a Bad Idea – Monoliths Are Your Friend

dbaplus Community

Mar 31, 2022 · Databases

Why Build Your Own Database Middleware in a Multi‑Cloud Era?

The article explains why, despite cloud services, enterprises still need to develop their own database middleware to ensure multi‑cloud compatibility, vendor neutrality, high availability, and scalable performance, detailing the challenges, design principles, core features, technical metrics, and operational benefits of such a solution.

Database MiddlewareMulti-CloudOperations

0 likes · 20 min read

Why Build Your Own Database Middleware in a Multi‑Cloud Era?

TAL Education Technology

Mar 31, 2022 · Cloud Computing

Hybrid Cloud Governance at TAL Education: Challenges, Methods, and Future Plans

This article examines TAL Education's hybrid‑cloud journey, explaining what hybrid cloud is, presenting industry adoption statistics, detailing the company's initial network chaos, outlining governance difficulties, describing the first‑phase remediation measures, and outlining the objectives and methods for the second‑phase transformation.

Hybrid CloudNetwork GovernanceOperations

0 likes · 20 min read

Hybrid Cloud Governance at TAL Education: Challenges, Methods, and Future Plans

IT Architects Alliance

Mar 30, 2022 · Operations

30 Essential Architecture Patterns for Scalable and Resilient Systems

This article systematically presents thirty architectural patterns—covering management, monitoring, performance, scalability, data handling, design, messaging, resilience, and security—to help engineers design, implement, and operate robust, high‑performance distributed systems.

Design PatternsOperationsPerformance

0 likes · 33 min read

30 Essential Architecture Patterns for Scalable and Resilient Systems

MaGe Linux Operations

Mar 28, 2022 · Databases

Why GitHub’s MySQL Cluster Crashed: Lessons from Recent Outages

GitHub experienced multiple service outages over recent weeks due to resource contention in its MySQL1 cluster, leading to prolonged downtimes, and the company disclosed detailed timelines, root causes, and ongoing mitigation measures such as load audits, traffic shifting, and infrastructure scaling to prevent future incidents.

GitHubMySQLOperations

0 likes · 3 min read

Why GitHub’s MySQL Cluster Crashed: Lessons from Recent Outages

DevOps Cloud Academy

Mar 28, 2022 · Operations

Understanding DevOps: Definition, Benefits, Practices, and Drawbacks

This article explains DevOps as a cultural, organizational, and technical shift that unifies development, operations, and quality assurance, outlines its benefits such as faster delivery and improved reliability, describes key practices like CI/CD, multi‑environment deployments, early failure detection, rollback, policy enforcement and observability, and discusses its potential drawbacks and considerations.

AutomationCI/CDOperations

0 likes · 12 min read

Understanding DevOps: Definition, Benefits, Practices, and Drawbacks

Efficient Ops

Mar 28, 2022 · Cloud Native

How to Scale Kubernetes Clusters: Quotas, Kernel Tweaks, and Etcd Best Practices

This guide explains how to adjust node quotas, tune kernel parameters, configure high‑availability etcd clusters, and set optimal Kube‑APIServer and Pod settings for large‑scale Kubernetes deployments, ensuring stability and performance as the cluster grows.

KubernetesOperationscloud-native

0 likes · 8 min read

How to Scale Kubernetes Clusters: Quotas, Kernel Tweaks, and Etcd Best Practices

Architecture Digest

Mar 26, 2022 · Operations

Top Free Docker GUI Tools for Efficient Container Management

This article reviews several free Docker graphical user interface (GUI) tools—including Portainer, DockStation, Docker Desktop, Lazydocker, and Docui—detailing their platform support, feature sets, Docker version compatibility, and practical usage scenarios for streamlined container administration.

Container ManagementDockerGUI

0 likes · 7 min read

Top Free Docker GUI Tools for Efficient Container Management

Efficient Ops

Mar 20, 2022 · Operations

How to Keep Your Kubernetes Nodes and Pods Stable: Essential Ops Practices

This guide walks through essential Kubernetes operations—from node kernel upgrades and Docker daemon tuning to pod resource limits, scheduling policies, health probes, logging standards, and comprehensive monitoring—providing practical commands and configurations to keep clusters stable and observable.

KubernetesNode ManagementOperations

0 likes · 18 min read

How to Keep Your Kubernetes Nodes and Pods Stable: Essential Ops Practices

Open Source Linux

Mar 18, 2022 · Operations

Evolution of Open‑Source Monitoring Tools: From Nagios to Prometheus

This article traces the development of open‑source monitoring solutions from early tools like Nagios and Cacti through modern platforms such as Prometheus and Nightingale, comparing their strengths, weaknesses, and typical use cases while also looking ahead to emerging observability trends in cloud‑native environments.

ObservabilityOperationsPrometheus

0 likes · 14 min read

Evolution of Open‑Source Monitoring Tools: From Nagios to Prometheus

Efficient Ops

Mar 17, 2022 · Operations

Inside China’s AIOps Standard: Key Insights from the 4th Draft Meeting

The article reports on the fourth draft discussion of China’s Cloud Computing Intelligent Operations (AIOps) Capability Maturity Model – Part 2, detailing the meeting’s participants, the finalized system and tool technical requirements, and the progress toward a comprehensive AIOps standard that addresses quality, cost, efficiency, and security across multiple functional modules.

AIOpsArtificial IntelligenceCloud Computing

0 likes · 5 min read

Inside China’s AIOps Standard: Key Insights from the 4th Draft Meeting

Selected Java Interview Questions

Mar 17, 2022 · Operations

Monitoring Nginx with Telegraf, InfluxDB, and Grafana

After setting up an Nginx cluster, this guide explains how to enable the nginx-status module, collect metrics with Telegraf, store them in InfluxDB, and visualize the data using Grafana, providing a complete solution for real-time Nginx monitoring.

GrafanaInfluxDBOperations

0 likes · 5 min read

Monitoring Nginx with Telegraf, InfluxDB, and Grafana

FunTester

Mar 17, 2022 · Operations

Turning Manual Performance Monitoring into Automated Multi‑Level Alerts

The author explains how they distinguished test automation from automated testing, identified monitoring pain points, built a custom scraper‑driven alert system with three escalation levels, tackled common pitfalls, and achieved faster, more reliable performance testing alerts.

Operationsalert systemmonitoring scripts

0 likes · 6 min read

Turning Manual Performance Monitoring into Automated Multi‑Level Alerts

Efficient Ops

Mar 15, 2022 · Cloud Native

How eBPF Powers Seamless Observability in Cloud‑Native Kubernetes Environments

This article explains why the rise of Kubernetes as a cloud‑native standard brings new observability challenges, outlines how eBPF enables non‑intrusive, multi‑language, multi‑protocol data collection, and describes a comprehensive monitoring stack—including golden metrics, service topology, tracing, alerts, and network diagnostics—to achieve end‑to‑end visibility in complex Kubernetes deployments.

KubernetesOperationscloud-native

0 likes · 22 min read

How eBPF Powers Seamless Observability in Cloud‑Native Kubernetes Environments

IT Architects Alliance

Mar 13, 2022 · Operations

30 Essential Architecture Patterns for Scalable, Resilient Systems

This article presents a comprehensive catalog of thirty architectural patterns—including management, monitoring, performance, data management, design, messaging, resilience, and security modes—explaining their purpose, typical use cases, benefits, and implementation considerations to help engineers build robust, high‑performance distributed applications.

OperationsResilienceSystem Design

0 likes · 32 min read

30 Essential Architecture Patterns for Scalable, Resilient Systems

Architects' Tech Alliance

Mar 12, 2022 · Cloud Computing

Understanding and Managing Complexity in Multi‑Cloud Infrastructure

The article examines the growing complexity of multi‑cloud and hybrid cloud environments, identifies security, API, and logging challenges, and proposes a flexible, cloud‑neutral automation platform with clear communication, audit, planning, and incremental implementation steps to reduce operational overhead and cost.

AutomationOperationscloud-native

0 likes · 12 min read

Understanding and Managing Complexity in Multi‑Cloud Infrastructure

AntTech

Mar 12, 2022 · Operations

Evolution of Large‑Scale Distributed System Stability at Ant Group

The article outlines Ant Group's multi‑stage journey of building large‑scale distributed system stability, describing architectural evolutions, risk‑inspection mechanisms, high‑availability solutions such as LDC and fine‑grained traffic scheduling, and intelligent risk‑defense products that together enable resilient, cost‑effective operations.

High AvailabilityOperationscapacity scaling

0 likes · 15 min read

Evolution of Large‑Scale Distributed System Stability at Ant Group

Dada Group Technology

Mar 11, 2022 · Operations

Design and Iteration of JD Daojia Order Timeliness System

This article details the background, overall architecture, iterative improvements, and future directions of JD Daojia's order timeliness system, covering early limitations, business‑driven challenges, solution iterations, order‑control mechanisms, product‑dimension handling, and the final business architecture to enhance fulfillment rates and user experience.

JD DaojiaOperationsbackend

0 likes · 11 min read

Design and Iteration of JD Daojia Order Timeliness System

Open Source Linux

Mar 11, 2022 · Operations

Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities

This article presents a curated list of practical Linux operation tools—including Nethogs, IOzone, IOTop, IPtraf, IFTop, HTop, NMON, MultiTail, Fail2ban, Tmux, Agedu, NMap, and Httperf—detailing their purpose, download links, installation commands, and basic usage to help system administrators improve monitoring, performance testing, and security on Linux servers.

LinuxOperationsTools

0 likes · 12 min read

Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities

DevOps Cloud Academy

Mar 10, 2022 · Operations

Nine Essential DevOps Metrics and KPIs for Effective Software Delivery

This article outlines nine key DevOps metrics and KPIs—including the four DORA indicators and five additional measures—explaining how they help teams monitor performance, improve delivery speed, ensure quality, and maintain reliable, high‑availability applications.

DORAKPIOperations

0 likes · 9 min read

Nine Essential DevOps Metrics and KPIs for Effective Software Delivery

MaGe Linux Operations

Mar 9, 2022 · Operations

Why Do 502 Errors Appear Only on POST Requests After Migrating to PaaS?

After moving an application to a PaaS platform, intermittent 502 errors occur, seemingly only for POST requests, but the root cause lies in Nginx‑Ingress and uwsgi HTTP version mismatches, connection reuse, and retry behavior, which can be diagnosed through traffic analysis and configuration changes.

502 errorHTTP version mismatchIngress

0 likes · 6 min read

Why Do 502 Errors Appear Only on POST Requests After Migrating to PaaS?

Open Source Linux

Mar 8, 2022 · Operations

Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs

This article breaks down Kubernetes troubleshooting into three essential steps—understanding the failure, managing the response, and preventing recurrence—while mapping key monitoring, observability, and incident‑response tools to each phase for reliable cloud‑native operations.

Incident ManagementKubernetesObservability

0 likes · 8 min read

Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs

High Availability Architecture

Mar 7, 2022 · Operations

Understanding High Concurrency, High Availability, Performance, and Scalability: Concepts and Metrics

This article systematically explains the relationships among high concurrency, high availability, performance, and scalability, defines their quantitative metrics, categorizes sources of change that affect system reliability, and outlines strategies for fault prediction, impact reduction, and rapid recovery in large‑scale services.

OperationsReliabilitySystem Design

0 likes · 11 min read

Understanding High Concurrency, High Availability, Performance, and Scalability: Concepts and Metrics

Efficient Ops

Mar 6, 2022 · Operations

How Top Chinese Insurers Achieved DevOps Maturity: Real‑World Case Studies

This article examines how three leading Chinese insurance companies used the nationally‑backed DevOps Capability Maturity Model to evaluate and improve their IT operations, detailing project architectures, cloud‑native implementations, continuous‑delivery results, and the broader significance of the DevOps standard.

Case StudyInsuranceMaturity Model

0 likes · 8 min read

How Top Chinese Insurers Achieved DevOps Maturity: Real‑World Case Studies

IT Architects Alliance

Mar 6, 2022 · Operations

Mastering Nginx: From Static Servers to Advanced Load Balancing and Reverse Proxy

This guide walks through deploying static files with Nginx, configuring location blocks and regex patterns, setting up reverse proxy to Java services, implementing various load‑balancing strategies (round‑robin, weight, ip_hash, fair, url_hash), separating static and dynamic content, and using essential directives such as return, rewrite, error_page, logging, deny, and built‑in variables.

NginxOperationsReverse Proxy

0 likes · 16 min read

Mastering Nginx: From Static Servers to Advanced Load Balancing and Reverse Proxy

Efficient Ops

Mar 3, 2022 · Operations

How China’s Telecom Giants Accelerate IT Efficiency with DevOps Maturity Models

This article reviews how leading Chinese telecom operators adopted the CAICT‑led DevOps Capability Maturity Model, detailing 17 evaluated projects across companies, the improvements achieved in continuous delivery, technical operations, and tooling, and the broader impact on IT efficiency and digital transformation.

IT efficiencyMaturity ModelOperations

0 likes · 15 min read

How China’s Telecom Giants Accelerate IT Efficiency with DevOps Maturity Models

Youzan Coder

Mar 3, 2022 · Operations

How Standard Deviation Uncovers Hidden Bottlenecks in Software R&D Throughput

The article introduces a new R&D efficiency metric—throughput standard deviation—explains its statistical basis, shows how it was derived from annual reports, illustrates its application across multiple teams, and discusses practical insights and limitations for software development operations.

OperationsR&D efficiencyThroughput

0 likes · 7 min read

How Standard Deviation Uncovers Hidden Bottlenecks in Software R&D Throughput

Efficient Ops

Mar 2, 2022 · Operations

How Chinese Banks Accelerate IT Efficiency with DevOps Maturity Models

This article reviews how major Chinese joint‑stock banks have adopted the CAICT‑led DevOps Capability Maturity Model, detailing assessment numbers, case studies of each bank's DevOps implementation, and the model’s standards and industry impact.

Case StudyMaturity ModelOperations

0 likes · 16 min read

How Chinese Banks Accelerate IT Efficiency with DevOps Maturity Models

Efficient Ops

Mar 2, 2022 · Operations

How Chinese Banks Accelerate Digital Transformation with DevOps Maturity Models

Amid a nationwide digital transformation push, Chinese banks such as Ningbo, Zhengzhou, Baixin, and others have leveraged the China Information and Communication Research Institute's DevOps Capability Maturity Model to assess and improve their IT efficiency, team integration, and continuous delivery practices, providing valuable industry insights.

Maturity ModelOperationsbanking

0 likes · 15 min read

How Chinese Banks Accelerate Digital Transformation with DevOps Maturity Models

DevOps Cloud Academy

Mar 2, 2022 · Operations

Key DevOps Metrics for Effective Software Delivery

This article explains the most important DevOps metrics—such as deployment frequency, lead time, automated test pass rate, change failure rate, MTTR, and others—and how tracking them helps teams improve software delivery speed, quality, and operational efficiency.

AutomationOperationsdevops

0 likes · 10 min read

Key DevOps Metrics for Effective Software Delivery

Efficient Ops

Mar 1, 2022 · Operations

How Chinese Banks Boost IT Efficiency with the DevOps Maturity Model

This article outlines how major Chinese banks have adopted the CAICT‑led DevOps Capability Maturity Model, presenting assessment counts across state‑owned, joint‑stock, and city commercial banks, summarizing the model’s standards, evaluation domains, and providing contact details for further inquiries.

IT efficiencyMaturity ModelOperations

0 likes · 6 min read

How Chinese Banks Boost IT Efficiency with the DevOps Maturity Model

Efficient Ops

Mar 1, 2022 · Operations

How China’s Leading Banks Master DevOps: Insights from the CAICT Maturity Model

This article reviews how major Chinese state‑owned banks applied the China Academy of Information and Communications Technology’s DevOps Capability Maturity Model, detailing assessment numbers, case studies of e‑life, AI advisory, mobile banking, and cloud‑native platforms, and highlighting the operational and security benefits achieved.

Maturity ModelOperationsbanking

0 likes · 17 min read

How China’s Leading Banks Master DevOps: Insights from the CAICT Maturity Model

MaGe Linux Operations

Feb 28, 2022 · Operations

109 Essential Shell Scripts to Automate Linux Operations – Free PDF

This article compiles 109 practical shell scripts covering a wide range of Linux automation tasks such as security, monitoring, backup, deployment, and system management, providing clear, copy‑ready code for hands‑on practice and interview preparation.

AutomationLinuxOperations

0 likes · 7 min read

109 Essential Shell Scripts to Automate Linux Operations – Free PDF

Practical DevOps Architecture

Feb 28, 2022 · Operations

Resolving Nginx 502 Bad Gateway Errors with SSL Handshake Issues and Buffer Configuration

This article analyzes 502 Bad Gateway errors caused by SSL handshake failures in Nginx, presents the relevant error logs and curl output, and provides a detailed configuration example—including buffer sizes, client limits, and proxy settings—to fix the issue.

502 Bad GatewayOperationsSSL handshake

0 likes · 3 min read

Resolving Nginx 502 Bad Gateway Errors with SSL Handshake Issues and Buffer Configuration

FunTester

Feb 27, 2022 · Operations

Performance Testing Articles Collection (Chinese Resources)

This collection compiles dozens of Chinese articles on performance testing, covering tools, frameworks, case studies, and techniques such as netdata monitoring, load generators, concurrency utilities, distributed testing, QPS modeling, and comparisons of JMeter, K6, Gatling, and FunTester.

BenchmarkingOperationsload testing

0 likes · 8 min read

Performance Testing Articles Collection (Chinese Resources)

Ops Development Stories

Feb 25, 2022 · Operations

Recovering a Ceph 16 Cluster After System Disk Failure

This guide walks through the step‑by‑step process of restoring a Ceph 16 cluster when a node's system disk fails, covering host removal, node re‑initialization, Docker and Cephadm installation, host addition, labeling, OSD recreation, and final verification.

CephCluster RecoveryOperations

0 likes · 7 min read

Recovering a Ceph 16 Cluster After System Disk Failure

IT Architects Alliance

Feb 23, 2022 · Operations

A Historical Overview of DevOps and Its Related Practices

This article traces the evolution of DevOps from its roots in Toyota’s Production System and early manufacturing practices through the emergence of Kanban, Waterfall, Scrum, Agile, Lean, and modern extensions like ChatOps, GitOps, FinOps and AiOps, highlighting key milestones and concepts.

AgileKanbanLean

0 likes · 10 min read

A Historical Overview of DevOps and Its Related Practices

ITFLY8 Architecture Home

Feb 23, 2022 · Operations

Essential Project Management Charts Every Manager Should Use

This guide introduces the most effective project management charts—including Gantt, burn‑down, WBS, HOQ, RACI, matrix, PERT, mind‑map, decision‑tree, and status tables—explaining their purpose, key components, and how to create them with common tools like Excel, Visio, and Xmind.

Burn-down ChartGantt ChartOperations

0 likes · 6 min read

Essential Project Management Charts Every Manager Should Use

HomeTech

Feb 23, 2022 · Operations

Construction and Future Planning of the Quality Assurance Technical System at Home

Facing rapid business growth and evolving mobile, AI, and automotive trends, the Home quality assurance team outlines its testing cloud platform’s three‑layer architecture, current capabilities such as performance, automation, and code scanning, the challenges it confronts, and its roadmap for expanding Paas and mobile testing.

Automation testingOperationscloud testing

0 likes · 11 min read

Construction and Future Planning of the Quality Assurance Technical System at Home

IT Services Circle

Feb 23, 2022 · Operations

Setting Up Spring Boot Admin to Monitor Spring Boot Applications

This guide explains how to create a Spring Boot Admin server, configure a Spring Boot client to register with it, enable Actuator for extended metrics, and view real‑time logs, providing a comprehensive monitoring solution for Java backend services.

OperationsSpring Bootactuator

0 likes · 9 min read

Setting Up Spring Boot Admin to Monitor Spring Boot Applications

Architecture Digest

Feb 19, 2022 · Operations

Guide to Setting Up and Using the JVM Monitoring Tool with Spring Boot

This article provides a step‑by‑step tutorial for installing, configuring, and running a JVM monitoring solution that integrates with Spring Boot, covering repository cloning, server configuration, Maven installation, application property setup, and accessing the monitor server UI.

GitJVM MonitoringMaven

0 likes · 4 min read

Guide to Setting Up and Using the JVM Monitoring Tool with Spring Boot

Ctrip Technology

Feb 17, 2022 · Operations

Evolution and Architecture of the Hickwall Enterprise Monitoring Platform

The article details the background, challenges, multi‑year evolution, current architecture, and future roadmap of Hickwall, Ctrip's enterprise‑grade monitoring and observability platform, covering metrics, logs, traces, high‑cardinality handling, cloud‑native integration, alert governance, and storage engine migrations.

AlertingObservabilityOperations

0 likes · 15 min read

Evolution and Architecture of the Hickwall Enterprise Monitoring Platform

IT Architects Alliance

Feb 15, 2022 · Operations

What Real-World Performance Tuning Taught Us About Legacy Web Apps

After a traffic surge exposed severe latency in a 15-year-old multi-service web platform, we used monitoring to discover a DB-connection leak caused by a liveness probe, corrected it, and distilled four practical lessons on latency metrics, tooling, legacy maintenance, and code vigilance.

APMOperationsPerformance

0 likes · 9 min read

What Real-World Performance Tuning Taught Us About Legacy Web Apps

dbaplus Community

Feb 14, 2022 · Operations

Building a Robust Monitoring System for Securities Firms with Open‑Source Tools

This article explains why securities firms must adopt comprehensive, centralized monitoring, outlines regulatory and SLA drivers, identifies common monitoring shortcomings, and provides a step‑by‑step guide using open‑source solutions like Zabbix and Grafana to design, implement, evaluate, and continuously improve monitoring management.

GrafanaIT infrastructureOperations

0 likes · 33 min read

Building a Robust Monitoring System for Securities Firms with Open‑Source Tools

IT Services Circle

Feb 12, 2022 · Operations

Elon Musk Unveils Latest Starship Updates and Ambitious Mars Plans

Elon Musk presented new Starship performance data, outlined a goal of up to 50 launches this year and three daily launches in the future, described the spacecraft’s dimensions, propulsion, heat shield and orbital refueling technology, and reiterated his long‑term vision of making humanity a multiplanet species by colonising Mars.

AerospaceMarsOperations

0 likes · 9 min read

Elon Musk Unveils Latest Starship Updates and Ambitious Mars Plans

Alibaba Terminal Technology

Feb 11, 2022 · Operations

How to Execute a Multi‑Phase IPv6 Migration for Large‑Scale Services

This guide outlines a comprehensive, three‑stage IPv6 migration roadmap—including network upgrades, DNS/HTTPDNS redesign, security hardening, cloud and CDN adaptation, and mobile/app adjustments—to achieve full IPv6‑only support across infrastructure, services, and end‑users while ensuring seamless performance and security.

CloudIPv6Network Migration

0 likes · 22 min read

How to Execute a Multi‑Phase IPv6 Migration for Large‑Scale Services

Efficient Ops

Feb 10, 2022 · Operations

Why Did a Metaspace Misconfiguration Crash Our Elastic Cloud Service?

A production incident on an elastic‑cloud deployment revealed that setting the JVM Metaspace limit to 64 MiB, while the application required around 76 MiB, triggered continuous Full GC, causing stop‑the‑world pauses, full‑line time‑outs, and a costly rollback.

Elastic CloudGCJVM

0 likes · 9 min read

Why Did a Metaspace Misconfiguration Crash Our Elastic Cloud Service?

Code Ape Tech Column

Feb 9, 2022 · Operations

Overview of Blue‑Green, Rolling, Canary, and A/B Testing Deployment Strategies

The article explains common release strategies—including blue‑green deployment, rolling updates, canary (gray) releases, and A/B testing—detailing their principles, advantages, limitations, and practical considerations for safely delivering new versions in production environments.

A/B testingBlue-GreenDeployment

0 likes · 9 min read

Overview of Blue‑Green, Rolling, Canary, and A/B Testing Deployment Strategies

Practical DevOps Architecture

Feb 8, 2022 · Operations

Extending Zabbix Monitoring with Custom Scripts and Handling Stale NFS Handles

This article explains how Zabbix monitoring can be extended with custom shell or Python scripts to gather business-specific metrics, demonstrates a sample script that checks disk usage, and provides three methods to resolve a stale NFS file handle error, including using fuser, process inspection, and forced unmount.

Custom ScriptNFSOperations

0 likes · 3 min read

Extending Zabbix Monitoring with Custom Scripts and Handling Stale NFS Handles

Kujiale Project Management

Feb 8, 2022 · Operations

Mastering Large-Scale SaaS Project Delivery: CoolHome’s Proven Process

This article shares CoolHome’s comprehensive SaaS project delivery framework, detailing each phase—from initiation and planning to execution, monitoring, and closure—while highlighting common pitfalls, improvement measures, and practical tips for managing large‑client engagements effectively.

OperationsSaaS deliverylarge client

0 likes · 13 min read

Mastering Large-Scale SaaS Project Delivery: CoolHome’s Proven Process

Efficient Ops

Feb 7, 2022 · Operations

How Xinwang Bank Overcame DevOps Hurdles to Pass a Level‑3 Continuous Delivery Assessment

In 2021, Xinwang Bank’s digital-native team tackled tight deadlines, tool migrations, personnel shifts, and intense debates to successfully achieve a Level‑3 DevOps continuous‑delivery assessment for its distributed consumer‑credit core system, demonstrating how coordinated effort and containerization can boost operational excellence.

Banking TechnologyOperationscontinuous delivery

0 likes · 9 min read

How Xinwang Bank Overcame DevOps Hurdles to Pass a Level‑3 Continuous Delivery Assessment

21CTO

Feb 7, 2022 · Operations

Why Every Line of Code Matters: Boosting Performance by 3000% with a Simple DB Fix

This article shares hard‑won lessons from optimizing fifteen high‑load web applications, highlighting how a tiny DB‑connection leak in a pod probe caused severe slowdown and how fixing it, along with proper load testing, monitoring, and investment in tools and people, can dramatically improve system performance.

APMOperationsdatabase connections

0 likes · 9 min read

Why Every Line of Code Matters: Boosting Performance by 3000% with a Simple DB Fix

Java Backend Technology

Feb 7, 2022 · Operations

Why Did the Internet Crash in 2021? 10 Major Outage Lessons

The article reviews ten significant 2021 internet outages—both domestic and international—analyzing their root causes, from server room power failures to configuration bugs, and highlights the operational lessons engineers can learn to improve system resilience.

Case StudyCloud ComputingOperations

0 likes · 17 min read

Why Did the Internet Crash in 2021? 10 Major Outage Lessons

dbaplus Community

Jan 29, 2022 · Operations

Accelerating Call Center Incident Recovery: Practical Fault Handling and Monitoring Strategies

This article walks through a real call‑center outage scenario, outlines step‑by‑step fault identification, emergency recovery actions, monitoring enhancements, concise emergency‑plan design, and introduces intelligent, automated event handling to help operations teams resolve incidents faster and more reliably.

Operationscall centeremergency plan

0 likes · 14 min read

Accelerating Call Center Incident Recovery: Practical Fault Handling and Monitoring Strategies

Architect

Jan 25, 2022 · Databases

Designing a High‑Availability Redis Service with Sentinel

This article explains why Redis needs high availability, defines failure scenarios, compares several HA architectures—including single‑instance, master‑slave with one or multiple Sentinel processes, and a three‑node solution with a virtual IP—and provides practical guidance for building a reliable Redis service.

High AvailabilityOperationsRedis

0 likes · 12 min read

Designing a High‑Availability Redis Service with Sentinel

Qunar Tech Salon

Jan 24, 2022 · Databases

Qunar 2021 Technical Salon – Infrastructure Articles Collection (Databases, Operations, Components)

This article compiles the 2021 Qunar Technical Salon infrastructure series, presenting original technical writings on databases, operational practices, and core components, each linked to detailed posts that share real‑world experiences, design guidelines, and performance insights for engineers and practitioners.

DatabasesOperationsQunar

0 likes · 7 min read

Qunar 2021 Technical Salon – Infrastructure Articles Collection (Databases, Operations, Components)

Top Architect

Jan 21, 2022 · Operations

Clearing and Reconciliation System in Payment Platforms: Architecture, Processes, and Data Handling

The article provides a comprehensive overview of payment‑system clearing and reconciliation, detailing fund inflow/outflow matching, reconciliation center functions, various one‑to‑one, many‑to‑many and one‑to‑many matching rules, data import, error handling, and auxiliary modules such as balance entry and abnormal data recovery.

FinancialOperationsReconciliation

0 likes · 29 min read

Clearing and Reconciliation System in Payment Platforms: Architecture, Processes, and Data Handling

Ops Development Stories

Jan 21, 2022 · Operations

How to Combine ELK and Zabbix for Real‑Time Log Alerting

This guide explains how to integrate ELK's Logstash with Zabbix using the logstash‑output‑zabbix plugin, covering installation, configuration of Logstash pipelines, Zabbix template and trigger setup, and testing the end‑to‑end alerting workflow.

AlertingELKLog Monitoring

0 likes · 17 min read

How to Combine ELK and Zabbix for Real‑Time Log Alerting

58UXD

Jan 20, 2022 · Operations

How to Redesign Offline Recruitment Stores: Key Moments and Service Strategies

This article analyzes the challenges of offline recruitment stores, explores why job seekers drop out or feel dissatisfied, presents user research from Chongqing, identifies critical service moments, and proposes concrete design strategies to improve the hiring experience and operational efficiency.

OperationsUser Researchoffline recruitment

0 likes · 12 min read

How to Redesign Offline Recruitment Stores: Key Moments and Service Strategies

Alibaba Cloud Developer

Jan 19, 2022 · Operations

Master Real-Time Logging and Debugging for Serverless Spring Boot Apps

This article explains how to monitor, collect, and analyze real‑time logs, use multi‑dimensional metrics, perform local debugging, and enable cloud‑edge debugging for Spring Boot applications running on a Serverless platform, providing step‑by‑step commands and visual guides.

LoggingOperationsServerless

0 likes · 10 min read

Master Real-Time Logging and Debugging for Serverless Spring Boot Apps

MaGe Linux Operations

Jan 14, 2022 · Operations

Choosing the Right Open‑Source Monitoring Tool: History, Pros, Cons & Use Cases

This comprehensive guide traces the evolution of open‑source monitoring solutions from the early 2000s to modern cloud‑native tools, comparing their strengths, weaknesses, and ideal deployment scenarios to help IT professionals select the most suitable monitoring product for their infrastructure.

OperationsPerformancecloud-native

0 likes · 14 min read

Choosing the Right Open‑Source Monitoring Tool: History, Pros, Cons & Use Cases

Beike Product & Technology

Jan 14, 2022 · Operations

Understanding Black Production in Real Estate: A Case Study of 贝壳

This article analyzes the structure of black production in the real estate industry, focusing on its impact on companies like 贝壳 and the countermeasures implemented to combat it.

Black ProductionIndustry AnalysisOperations

0 likes · 9 min read

Understanding Black Production in Real Estate: A Case Study of 贝壳

IT Xianyu

Jan 14, 2022 · Operations

Redis Monitoring, Data Migration, and Cluster Management Tools Overview

This article introduces essential Redis operational tools, covering the INFO command for monitoring, Prometheus‑based redis‑exporter visualization, the Redis‑shake data migration utility, Redis‑full‑check consistency verification, and the CacheCloud platform for comprehensive cluster management.

CacheCloudData MigrationOperations

0 likes · 10 min read

Redis Monitoring, Data Migration, and Cluster Management Tools Overview

Aikesheng Open Source Community

Jan 13, 2022 · Databases

Best Practices for MySQL Database Inspection: Methods, Classification, and Deep Inspection

The article shares a comprehensive methodology for MySQL database inspection, classifying checks by method, time and depth, and detailing availability, reliability, and performance best‑practice items along with concrete command examples and configuration recommendations.

Best PracticesDatabase InspectionMySQL

0 likes · 15 min read

Best Practices for MySQL Database Inspection: Methods, Classification, and Deep Inspection

Programmer DD

Jan 11, 2022 · Operations

Building a TB‑Scale Log Monitoring System with ELK Stack and Kafka Streams

This article explains how to design and implement a terabyte‑level log monitoring platform using ELK Stack, FileBeat, Elastic APM, Kafka Streams, Prometheus, and Grafana, covering data collection, filtering, visualization, and resource‑efficient processing for large‑scale microservice environments.

ELKGrafanaLog Monitoring

0 likes · 9 min read

Building a TB‑Scale Log Monitoring System with ELK Stack and Kafka Streams

Architecture Digest

Jan 10, 2022 · Operations

Comprehensive Guide to Deploying Filebeat and Graylog for Centralized Log Collection

This article explains how to use Filebeat and Graylog together for centralized log collection, covering Filebeat’s role, configuration files, input modules, Graylog’s architecture, pipeline rules, and step‑by‑step deployment using Docker and docker‑compose, providing practical commands and examples for operational environments.

DockerElasticsearchGraylog

0 likes · 14 min read

Comprehensive Guide to Deploying Filebeat and Graylog for Centralized Log Collection

Open Source Linux

Jan 10, 2022 · Operations

Where Does Linux Store Its System Logs? A Complete /var/log Guide

This article enumerates the most common Linux system log files under /var/log, explains the purpose of each log—including messages, dmesg, auth, boot, daemon, and service-specific logs—and lists key subdirectories for web, mail, audit, and other services.

LinuxLog FilesOperations

0 likes · 5 min read

Where Does Linux Store Its System Logs? A Complete /var/log Guide

Top Architect

Jan 7, 2022 · Operations

Technical Analysis of the Xi'an Health Code System Crash and Its Performance Bottlenecks

The article examines the repeated failures of the Xi'an health‑code platform, explaining that the root cause lies in serving all static assets (JS, CSS, images) from a single un‑CDN endpoint, which under peak load of 33 000 requests overwhelms the network bandwidth, leading to a crash.

Backend PerformanceCDNNetwork Bandwidth

0 likes · 5 min read

Technical Analysis of the Xi'an Health Code System Crash and Its Performance Bottlenecks

58UXD

Jan 7, 2022 · Operations

How 3D World‑Building Transforms Home‑Service Operations at 58 Daojia

By analyzing user demographics, business model, and marketing needs, the 58 Daojia case study shows how constructing a cohesive 3D‑driven worldview—through gene‑family characters, mood‑board visual systems, and story‑centric design—enhances operational campaigns, boosts brand immersion, and streamlines visual asset reuse.

3D designO2OOperations

0 likes · 7 min read

How 3D World‑Building Transforms Home‑Service Operations at 58 Daojia

Practical DevOps Architecture

Jan 5, 2022 · Operations

Deploying Prometheus and Node Exporter on a Linux Host

This guide walks through installing Prometheus and Node Exporter on a Linux server, copying binaries to system paths, configuring Prometheus with scrape jobs for the local node and remote hosts, and running the exporters with specific collector options for system metrics.

OperationsPrometheusmonitoring

0 likes · 4 min read

Deploying Prometheus and Node Exporter on a Linux Host

Tencent Qidian Tech Team

Jan 5, 2022 · Operations

Building a Real‑Time Log Tracker for Phone SDKs Using Cloud‑Native Design

This article describes the design and implementation of a comprehensive log tracking system for a phone SDK, covering client‑side logging, colored classification, plugin mechanisms, cloud‑native architecture, serverless functions, Elasticsearch storage, and real‑time visual debugging to enable rapid issue identification and resolution.

OperationsServerlesscloud-native

0 likes · 18 min read

Building a Real‑Time Log Tracker for Phone SDKs Using Cloud‑Native Design

Efficient Ops

Jan 4, 2022 · Operations

How China’s Leading Enterprises Are Shaping the New DevOps Efficiency Measurement Standard

The article explains how a collaborative effort by CAICT and more than 30 leading tech and financial companies created the most comprehensive DevOps efficiency measurement model, detailing its components, evaluation results, and upcoming assessment enrollment for enterprises seeking to boost software development performance.

Operationsdevopsdigital transformation

0 likes · 6 min read

How China’s Leading Enterprises Are Shaping the New DevOps Efficiency Measurement Standard

Architect's Tech Stack

Jan 3, 2022 · Operations

Overview of Redis Monitoring, Data Migration, and Cluster Management Tools

This article introduces essential Redis operational tools, covering real‑time monitoring with the INFO command and exporters, data migration using Redis‑shake, consistency checking via Redis‑full‑check, and cluster management through CacheCloud, while highlighting key metrics such as stat, commandstat, cpu, and memory.

CacheCloudOperationsPrometheus

0 likes · 10 min read

Overview of Redis Monitoring, Data Migration, and Cluster Management Tools

Top Architect

Jan 3, 2022 · Operations

Gray Release (Canary Deployment) Design and Implementation Guide

This article explains the concept of gray release, outlines a simple architecture with essential components, describes common traffic-splitting strategies, shows how to implement control in Nginx and service layers, and discusses complex scenarios such as multi‑service and data‑centric deployments.

A/B testingCanary DeploymentDeployment Strategy

0 likes · 7 min read

Gray Release (Canary Deployment) Design and Implementation Guide

Continuous Delivery 2.0

Dec 31, 2021 · Operations

Curated Reading List on DevOps, Software Delivery Performance, and Engineering Productivity

This article presents a concise collection of ten Chinese-language resources that summarize the 2021 DORA DevOps report, the importance of consistency in R&D, fundamental efficiency principles, Microsoft’s testing shift, Google’s release and productivity metrics, and SRE health measurements, offering valuable insights for modern software engineering teams.

Engineering ProductivityOperationsSRE

0 likes · 5 min read

Curated Reading List on DevOps, Software Delivery Performance, and Engineering Productivity

Alibaba Cloud Native

Dec 30, 2021 · Operations

How to Implement Chaos Engineering for Cloud‑Native Applications: A Step‑by‑Step Guide

This article explains how cloud‑native teams can adopt chaos engineering—defining its concepts, outlining its unique characteristics, and detailing a four‑stage implementation process from manual drills to production‑level raids, with practical steps, environment setups, and real‑world results.

Fault InjectionKubernetesOperations

0 likes · 14 min read

How to Implement Chaos Engineering for Cloud‑Native Applications: A Step‑by‑Step Guide

HomeTech

Dec 30, 2021 · Operations

Open-falcon in Automotive Home: Application, Architecture, and Customizations

This article describes how the open‑falcon monitoring system is applied and customized at Automotive Home, covering its architecture, component roles, a comparison with other open‑source solutions, and the enhancements made for service‑tree based dynamic monitoring, alerting, self‑healing, and high‑availability deployment.

Open-FalconOperationsmonitoring

0 likes · 11 min read

Open-falcon in Automotive Home: Application, Architecture, and Customizations