Tagged articles
3281 articles
Page 14 of 33
Efficient Ops
Efficient Ops
Oct 12, 2022 · Operations

How Chinese Banks Accelerate IT Efficiency with DevOps Maturity Assessments

This article reviews how leading Chinese banks and financial institutions have adopted the CAICT DevOps Capability Maturity Model, detailing their assessment results across continuous delivery, technical operations, security, and tooling standards, and highlighting the operational benefits achieved.

BankingContinuous DeliveryDevOps
0 likes · 16 min read
How Chinese Banks Accelerate IT Efficiency with DevOps Maturity Assessments
Efficient Ops
Efficient Ops
Oct 12, 2022 · Operations

How China’s State Banks Achieved Top DevOps Maturity: Real‑World Case Studies

This article reviews how major Chinese state‑owned banks applied the China Information Communication Research Institute's DevOps Capability Maturity Model, detailing assessment results, project implementations, and performance improvements across continuous delivery, security, and system tooling, offering valuable insights for enterprises pursuing DevOps transformation.

BankingContinuous DeliveryDevOps
0 likes · 18 min read
How China’s State Banks Achieved Top DevOps Maturity: Real‑World Case Studies
dbaplus Community
dbaplus Community
Oct 8, 2022 · Operations

Designing High‑Availability Internet Architecture: Redundancy and Automatic Failover

This article explains how to achieve high availability in internet systems by layering architecture, using redundancy and automatic failover across access, proxy, microservice, middleware, and storage components, and discusses practical techniques, common pitfalls, and operational safeguards for resilient services.

MicroservicesOperationsautomatic failover
0 likes · 19 min read
Designing High‑Availability Internet Architecture: Redundancy and Automatic Failover
DevOps Cloud Academy
DevOps Cloud Academy
Oct 5, 2022 · Operations

Deming's Fourteen Points of Quality Management

The article outlines Deming's fourteen fundamental principles for quality management, emphasizing a permanent purpose of improvement, a new philosophy, eliminating reliance on inspection, fostering continuous improvement, modern training and supervision, breaking departmental barriers, and establishing top‑level leadership to drive ongoing innovation.

Continuous ImprovementDemingLeadership
0 likes · 7 min read
Deming's Fourteen Points of Quality Management
Liangxu Linux
Liangxu Linux
Oct 2, 2022 · Operations

Essential Linux Ops Practices: Prevent Disasters and Boost Stability

Drawing from three and a half years of Linux operations, this guide outlines practical standards for testing, confirming commands, avoiding concurrent edits, mandatory backups, data safety, security hardening, continuous monitoring, performance tuning, and the right mindset to keep production environments stable and secure.

BackupLinuxOperations
0 likes · 12 min read
Essential Linux Ops Practices: Prevent Disasters and Boost Stability
Architects Research Society
Architects Research Society
Sep 28, 2022 · Operations

The 13 Most Difficult IT Roles to Fill in 2021: Insights from the CIO Survey

The 2021 CIO Survey reveals that organizations worldwide are struggling to fill cybersecurity, artificial intelligence, and data analytics positions, with remote work expanding the talent pool but still leaving critical roles hard to staff, highlighting the need for strategic prioritization and new hiring approaches.

AI RecruitmentCIO surveyCloud Services
0 likes · 14 min read
The 13 Most Difficult IT Roles to Fill in 2021: Insights from the CIO Survey
dbaplus Community
dbaplus Community
Sep 27, 2022 · Operations

How to Build a Scalable Rate‑Limiting System with Kong in Cloud‑Native Operations

This article outlines a comprehensive, cloud‑native rate‑limiting solution using Kong gateway, covering background challenges, design considerations, multi‑layer architecture, plugin development, CI/CD workflow, deployment strategies, and operational best practices to achieve low cost, high efficiency, and high quality across diverse projects.

KongMicroservicesOperations
0 likes · 24 min read
How to Build a Scalable Rate‑Limiting System with Kong in Cloud‑Native Operations
Aikesheng Open Source Community
Aikesheng Open Source Community
Sep 27, 2022 · Operations

Refactoring Alertmanager: Reducing Noise, Improving Escalation, Suppression, and Silence Management

This article shares practical experiences and solutions for improving an Alertmanager‑based alert system, addressing problems such as noisy alerts, lack of escalation, missing recovery notifications, suppression limitations, and cumbersome silence management by redesigning architecture, adding custom scripts, and extending database support.

AlertingAlertmanagerOperations
0 likes · 19 min read
Refactoring Alertmanager: Reducing Noise, Improving Escalation, Suppression, and Silence Management
DevOps Cloud Academy
DevOps Cloud Academy
Sep 26, 2022 · Operations

Using Jenkins Deploy Dashboard Plugin for Visual Deployment Management

This article explains how to install and configure the Deploy Dashboard plugin in Jenkins to visualize deployment versions across environments, add deployment information via pipeline code, create custom dashboard views, and add quick‑deploy buttons for streamlined CI/CD operations.

Deploy DashboardDevOpsJenkins
0 likes · 5 min read
Using Jenkins Deploy Dashboard Plugin for Visual Deployment Management
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Sep 26, 2022 · Operations

How to Tame Alert Storms: Building a Systematic Monitoring and Alerting Framework for Microservices

This article analyzes the challenges of alert overload in large‑scale microservice environments and presents a systematic approach—including timeliness metrics, a maturity model, lifecycle tracking, feedback loops, downgrade mechanisms, and cross‑service aggregation—to improve alert effectiveness and reduce noise.

Alert ManagementMTTRMicroservices
0 likes · 16 min read
How to Tame Alert Storms: Building a Systematic Monitoring and Alerting Framework for Microservices
dbaplus Community
dbaplus Community
Sep 25, 2022 · Operations

How to Achieve Zero‑Downtime Application Deployments with Spring Boot and Eureka

This article explains why zero‑downtime releases are essential for modern services, defines three maturity levels, compares common release patterns, outlines the required technical components, and provides step‑by‑step Spring Boot/Eureka procedures—including configuration and graceful‑shutdown scripts—to keep applications available during deployment.

DeploymentOperationsZero Downtime
0 likes · 20 min read
How to Achieve Zero‑Downtime Application Deployments with Spring Boot and Eureka
FunTester
FunTester
Sep 25, 2022 · Databases

Data Migration Scenarios, Testing, and Acceptance Guidelines

This article outlines common data migration scenarios such as system consolidation and database sharding, details analysis of user data before migration, discusses conflict resolution rules, presents migration planning and acceptance testing steps, and highlights post‑release monitoring and user feedback handling.

Data MigrationOperationsacceptance
0 likes · 7 min read
Data Migration Scenarios, Testing, and Acceptance Guidelines
Code Ape Tech Column
Code Ape Tech Column
Sep 24, 2022 · Operations

Overview of Redis Monitoring, Data Migration, and Cluster Management Tools

This article introduces essential Redis operational tools, covering real‑time monitoring with the INFO command and Prometheus‑exporter, data migration using Redis‑shake, consistency checking via Redis‑full‑check, and cluster management through CacheCloud, providing practical guidance for administrators.

Cluster ManagementData MigrationOperations
0 likes · 10 min read
Overview of Redis Monitoring, Data Migration, and Cluster Management Tools
58UXD
58UXD
Sep 22, 2022 · Product Management

How 58 Recruitment Built a Unified Brand Experience with Design Standardization

This article explains how 58 Recruitment’s design team created a consistent brand feel across multiple online hiring events by using user insights, clear positioning, standardized visual principles, and a modular template system that speeds up design, development, and deployment while enhancing user perception.

OperationsUXbrand design
0 likes · 10 min read
How 58 Recruitment Built a Unified Brand Experience with Design Standardization
Huolala Tech
Huolala Tech
Sep 22, 2022 · Operations

How HuoLala Engineered a Scalable, High‑Availability Monitoring System for Multi‑Cloud

This article details the evolution of monitoring technologies, HuoLala's three‑phase monitoring architecture, the integration of Prometheus, VictoriaMetrics and SkyWalking, zero‑intrusion bytecode instrumentation, full‑link trace sampling, visual dashboards, metric‑trace‑log correlation, and future plans for root‑cause analysis and intelligent alerting.

Operationsbytecodecloud
0 likes · 24 min read
How HuoLala Engineered a Scalable, High‑Availability Monitoring System for Multi‑Cloud
Efficient Ops
Efficient Ops
Sep 18, 2022 · Operations

Speed Up Sysadmin Tasks: Fast File Deletion, iSCSI Detection, and Group Management

This article shares practical Linux and vSphere techniques—including using rsync for rapid bulk deletions, scanning SCSI devices without reboot, safeguarding rm with shell parameter expansion, mounting remote filesystems via sshfs, and managing user groups with gpasswd—to boost everyday operations efficiency.

OperationsShellSysadmin
0 likes · 11 min read
Speed Up Sysadmin Tasks: Fast File Deletion, iSCSI Detection, and Group Management
Architects Research Society
Architects Research Society
Sep 16, 2022 · Operations

Building a Reliability Culture: Practices, Benefits, and Implementation

This article explains what a reliability culture is, why it matters, how to cultivate it through mission statements, early‑stage reliability testing, chaos‑engineering practices like GameDays and FireDrills, and how organizations can continuously learn from incidents to improve system availability and customer trust.

CultureOperationsReliability
0 likes · 18 min read
Building a Reliability Culture: Practices, Benefits, and Implementation
DevOps Engineer
DevOps Engineer
Sep 13, 2022 · Operations

DevOps Learning Roadmap 2022 by Vrashabh Sontakke

This article presents a comprehensive 2022 DevOps learning roadmap compiled by engineer Vrashabh Sontakke, providing downloadable images and links to detailed resources that outline the essential tools, practices, and knowledge areas for aspiring DevOps professionals.

2022DevOpsOperations
0 likes · 2 min read
DevOps Learning Roadmap 2022 by Vrashabh Sontakke
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Sep 13, 2022 · Operations

How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices

This article details Yanxuan's four‑year evolution of a unified monitoring, alerting, and event‑bus platform for micro‑service architectures, covering design principles, technology selection, multi‑stage implementation, dynamic sampling, custom plugins, data modeling, visualization upgrades, and the final fault‑driven, system‑wide integration.

AlertingFull‑Link TracingMicroservices
0 likes · 23 min read
How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices
Huolala Tech
Huolala Tech
Sep 8, 2022 · Databases

Why Build Your Own Database Middleware in the Multi‑Cloud Era?

The article explains why, contrary to common belief, the rise of multi‑cloud environments actually demands self‑built database middleware to ensure seamless adaptation, vendor neutrality, high availability, and cost‑effective scalability for growing enterprise workloads.

Database MiddlewareOperationsScalability
0 likes · 18 min read
Why Build Your Own Database Middleware in the Multi‑Cloud Era?
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 7, 2022 · Operations

Deming's Fourteen Points of Quality Management

The article outlines Deming's fourteen fundamental principles for quality management, emphasizing long‑term product and service improvement, statistical control, continuous process enhancement, employee empowerment, cross‑department collaboration, and the establishment of a high‑level management structure that drives perpetual innovation.

Continuous ImprovementDemingOperations
0 likes · 6 min read
Deming's Fourteen Points of Quality Management
Dada Group Technology
Dada Group Technology
Sep 5, 2022 · Operations

Design and Implementation of JD.com Data Construction Platform for Testing Efficiency

This article describes the motivation, design, architecture, key features, and outcomes of JD.com's data construction platform, which automates test data creation using a Springboot‑Mybatis‑Vue stack, significantly reducing manual effort and improving testing efficiency across multiple business lines.

OperationsTesting Automationdata construction
0 likes · 9 min read
Design and Implementation of JD.com Data Construction Platform for Testing Efficiency
Open Source Linux
Open Source Linux
Sep 1, 2022 · Operations

What’s New in Zabbix 6.0? Enhanced Monitoring, HA, AI & Cloud Features Explained

Zabbix 6.0 introduces a suite of enhancements—including high‑availability clustering, advanced business‑service monitoring with SLA calculations, root‑cause analysis, machine‑learning‑based anomaly detection, Kubernetes templates, a redesigned audit log, TLS certificate checks, UI improvements, customizable branding, and new integrations—aimed at boosting operational visibility and efficiency across cloud and on‑premise environments.

KubernetesOperationsZabbix
0 likes · 12 min read
What’s New in Zabbix 6.0? Enhanced Monitoring, HA, AI & Cloud Features Explained
dbaplus Community
dbaplus Community
Sep 1, 2022 · Operations

How Vivo’s Server‑Side Monitoring Evolved: Architecture, Data Flow, and Alert Strategies

This article provides a comprehensive overview of Vivo's server‑side monitoring system, detailing its architecture evolution, data collection pipelines, OpenTSDB storage design, alerting mechanisms, and comparisons with other mainstream monitoring solutions, offering practical guidance for technology selection and implementation.

OpenTSDBOperationsSystem Architecture
0 likes · 18 min read
How Vivo’s Server‑Side Monitoring Evolved: Architecture, Data Flow, and Alert Strategies
Liangxu Linux
Liangxu Linux
Aug 31, 2022 · Operations

Why TIME_WAIT Connections Accumulate and How to Fix Them

In high‑concurrency scenarios, massive TIME_WAIT TCP connections can exhaust local ports, causing new connections to fail, but by understanding the TCP four‑handshake, adjusting socket reuse settings, and using keep‑alive, you can mitigate the issue.

LinuxNetworkingOperations
0 likes · 8 min read
Why TIME_WAIT Connections Accumulate and How to Fix Them
Efficient Ops
Efficient Ops
Aug 30, 2022 · Operations

How ICBC Standardized Continuous Delivery to Supercharge DevOps Efficiency

This article details Industrial and Commercial Bank of China's journey to standardize continuous delivery, outlining the background challenges, the definition of release units, the construction of a standardized toolchain, implementation results, and future plans to enhance DevOps performance across the enterprise.

Continuous DeliveryDevOpsOperations
0 likes · 9 min read
How ICBC Standardized Continuous Delivery to Supercharge DevOps Efficiency
DataFunSummit
DataFunSummit
Aug 30, 2022 · Operations

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

This article presents the design, implementation, and evaluation of CloudRCA, an intelligent root cause analysis framework for Alibaba Cloud's big‑data computing services, detailing challenges such as heterogeneous data, sample imbalance, and real‑time constraints, and describing the multi‑stage data processing, hierarchical Bayesian modeling, and deployment results that reduce MTTR by 20%.

Big DataOperationsRoot Cause Analysis
0 likes · 16 min read
CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms
Zuoyebang Tech Team
Zuoyebang Tech Team
Aug 26, 2022 · Operations

How We Built a Three‑Layer Stability System for Massive Scale Operations

This article details the operational mindset, stability framework, and transformation journey of the Zuoyebang infrastructure team, covering service lifecycle management, standardization, cloud‑native architecture, multi‑active deployment, incident pre‑plan platforms, traffic scheduling, monitoring, capacity planning, and future directions for SRE service‑orientation.

InfrastructureOperationsSRE
0 likes · 20 min read
How We Built a Three‑Layer Stability System for Massive Scale Operations
Architects Research Society
Architects Research Society
Aug 25, 2022 · Operations

Core Reliability Principles in the Google Cloud Architecture Framework

This article outlines the core reliability principles of the Google Cloud Architecture Framework, explaining key terms such as SLI, SLO, error budget, and SLA, and describing design and operational guidelines for defining reliability goals, building observability, ensuring high availability, creating robust processes, effective alerting, and collaborative incident management.

Error BudgetOperationsReliability
0 likes · 12 min read
Core Reliability Principles in the Google Cloud Architecture Framework
ITPUB
ITPUB
Aug 20, 2022 · Operations

How Meituan Scaled Its CI/CD Pipeline Engine to 100k Daily Jobs with 99.99% Success

This article details Meituan's three‑year journey building a self‑developed pipeline engine that now handles nearly 100,000 daily executions with over 99.99% reliability, covering background, challenges, architectural decisions, core scheduling and resource‑pool designs, component layering, and future cloud‑native plans.

Job SchedulingOperationsPipeline
0 likes · 25 min read
How Meituan Scaled Its CI/CD Pipeline Engine to 100k Daily Jobs with 99.99% Success
Software Development Quality
Software Development Quality
Aug 19, 2022 · Operations

Comprehensive Quality Management SLA Framework for IT Services

This document outlines a detailed Service Level Agreement (SLA) framework covering quality service standards, management processes, testing capabilities, tool support, resource management, measurement systems, risk handling, and continuous improvement to ensure consistent delivery and customer satisfaction across IT operations.

OperationsSLATraining
0 likes · 17 min read
Comprehensive Quality Management SLA Framework for IT Services
Cloud Native Technology Community
Cloud Native Technology Community
Aug 18, 2022 · Operations

Understanding DevOps: Integrating Development and Operations Beyond the ‘Who Develops Who Operates’ Myth

The article clarifies common misconceptions about DevOps, explains that true development‑operations integration relies on dedicated ops teams, automation tools, standardized delivery artifacts, and unified permission management rather than developers performing ops tasks, and highlights Google SRE practices as a practical guide.

DevOpsInfrastructureOperations
0 likes · 10 min read
Understanding DevOps: Integrating Development and Operations Beyond the ‘Who Develops Who Operates’ Myth
Qunar Tech Salon
Qunar Tech Salon
Aug 17, 2022 · Operations

Design and Optimization of Testing Environment 3.0 at Qunar Travel

This article describes how Qunar Travel has evolved its testing environment governance from a fixed 10‑machine setup to a template‑driven, soft‑routing architecture (Environment 3.0), improving delivery speed, reliability, business connectivity, and reducing operational costs through automated sync, smart recommendations, and continuous business checks.

EnvironmentMicroservicesOperations
0 likes · 22 min read
Design and Optimization of Testing Environment 3.0 at Qunar Travel
DevOps
DevOps
Aug 17, 2022 · Operations

Measuring Success in Continuous Delivery: Four Key Metrics and Practical Tips

This article explains why measuring is essential for continuous delivery, introduces four valuable metrics—deployable package count, cycle time, mean time between failures, and mean time to recovery—and offers practical tips to improve delivery speed and reliability.

Continuous DeliveryDevOpsMTBF
0 likes · 7 min read
Measuring Success in Continuous Delivery: Four Key Metrics and Practical Tips
Huolala Tech
Huolala Tech
Aug 11, 2022 · Operations

How Huolala Built an AI‑Powered Intelligent Monitoring Platform at Scale

This article details Huolala's journey from basic monitoring to an AI‑driven intelligent observability platform, covering AIOps concepts, a comprehensive monitoring framework, practical implementations, automated alert analysis, lessons learned, and future directions for large‑scale operations.

DevOpsHuolalaOperations
0 likes · 18 min read
How Huolala Built an AI‑Powered Intelligent Monitoring Platform at Scale
Efficient Ops
Efficient Ops
Aug 9, 2022 · Operations

How ICBC Accelerated Digital Transformation with XOps: From DevOps to MLOps

ICBC’s software development center outlines its multi‑year journey adopting XOps practices—DevOps, DevSecOps, DataOps, MLOps, AIOps, ChatOps and BizDevOps—to boost development efficiency, enhance security, accelerate data‑driven AI, and cut costs, showcasing measurable improvements in release frequency, defect rates, and operational automation.

DataOpsDevOpsDigitalTransformation
0 likes · 13 min read
How ICBC Accelerated Digital Transformation with XOps: From DevOps to MLOps
Architecture Digest
Architecture Digest
Aug 8, 2022 · Operations

Log Shrinking Techniques and Case Study for Reducing Log Size

This article explains why oversized logs hurt system performance, presents three practical log‑shrinking strategies—printing only necessary logs, merging duplicate entries, and simplifying content—illustrates them with Java code snippets, and evaluates their impact through a real‑world case that cuts daily log volume from 5 GB to under 1 GB.

BackendOperationslog optimization
0 likes · 7 min read
Log Shrinking Techniques and Case Study for Reducing Log Size
DevOps Cloud Academy
DevOps Cloud Academy
Aug 7, 2022 · Operations

Key Capabilities for Continuous Delivery and DevOps Success

The article outlines twenty‑four essential capabilities—spanning continuous delivery, architecture, product and process, lean management, and culture—that research shows drive high performance in software delivery and organizational outcomes.

CultureDevOpsLean Management
0 likes · 10 min read
Key Capabilities for Continuous Delivery and DevOps Success
Liangxu Linux
Liangxu Linux
Aug 6, 2022 · Operations

When Core Switches Suddenly Die: The Hidden SSD Time‑Bomb in Network Gear

A network engineer recounts a terrifying outage caused by a firmware‑related SSD bug that locks core switches after 28,224 hours of use, explains the emergency troubleshooting steps taken, and highlights the need for better vendor recall mechanisms to protect critical infrastructure.

Hardware ReliabilityOperationsSSD bug
0 likes · 8 min read
When Core Switches Suddenly Die: The Hidden SSD Time‑Bomb in Network Gear
Python Crawling & Data Mining
Python Crawling & Data Mining
Aug 6, 2022 · Operations

Why Operations Data Quality Is the Key to Successful Digital Transformation

In the era of big data, poor operations data quality undermines analytics, decision‑making and digital transformation, so organizations must adopt a three‑dimensional governance approach—covering organization, processes and technology—to ensure completeness, consistency, accuracy, uniqueness, relevance and timeliness of their operational data.

AnalyticsData GovernanceData Quality
0 likes · 17 min read
Why Operations Data Quality Is the Key to Successful Digital Transformation
Ops Development Stories
Ops Development Stories
Aug 6, 2022 · Cloud Native

8 Proven Strategies to Beat Alert Fatigue in Kubernetes

This article explains why alert fatigue harms on‑call teams in Kubernetes environments and offers eight practical techniques—ranging from metric definition to alert suppression—to reduce noise, improve response efficiency, and protect team well‑being.

KubernetesOperationsalert fatigue
0 likes · 8 min read
8 Proven Strategies to Beat Alert Fatigue in Kubernetes
MaGe Linux Operations
MaGe Linux Operations
Aug 3, 2022 · Operations

Record Heat Waves Cripple Global Data Centers and Cloud Services

Extreme summer temperatures across the globe have forced major cloud providers like Google and Oracle to shut down servers, caused runway meltings in the UK, triggered heat‑related health crises, and highlighted the vulnerability of data‑center cooling systems to unprecedented heat waves.

Operationscloud computingdata center
0 likes · 7 min read
Record Heat Waves Cripple Global Data Centers and Cloud Services
DataFunSummit
DataFunSummit
Aug 2, 2022 · Big Data

Tencent PCG Real‑Time Data Warehouse and Operations Architecture Overview

This article presents Tencent's PCG data platform evolution, detailing the challenges of integrating multiple business groups, the design of a unified big‑data architecture, real‑time and batch processing pipelines, MQ and ATTA systems, and comprehensive operational practices for reliability and scalability.

ATTABig DataMQ
0 likes · 17 min read
Tencent PCG Real‑Time Data Warehouse and Operations Architecture Overview
Top Architect
Top Architect
Aug 2, 2022 · Operations

Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems

This article presents a comprehensive guide on diagnosing, monitoring, and quickly resolving call‑center system failures, covering common troubleshooting steps, monitoring enhancements, emergency‑plan design, and intelligent event‑handling techniques to improve operational reliability and response speed.

Operationsemergency responsefault handling
0 likes · 15 min read
Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems
Bilibili Tech
Bilibili Tech
Aug 2, 2022 · Operations

Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections

Bilibili adopted Google‑SRE SLO practices—selecting SLIs, defining availability and latency targets, grading services, and tracking error budgets—but encountered costly grading inconsistencies, hidden error detection, and inaccurate business‑level metrics, leading them to realize SLOs are chiefly valuable for early alerting rather than exhaustive reporting.

Cloud NativeError BudgetOperations
0 likes · 21 min read
Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections
Architects Research Society
Architects Research Society
Aug 1, 2022 · Operations

Enterprise Automation as a Strategic Initiative: Framework, Value, and the Role of Ansible Automation Platform

The article explains how enterprise automation should be treated as a strategic business initiative, outlines the value drivers such as quality, agility, compliance and cost savings, and presents a framework—including Ansible Automation Platform—to integrate and coordinate diverse automation projects across hybrid cloud environments.

AnsibleEnterprise AutomationOperations
0 likes · 10 min read
Enterprise Automation as a Strategic Initiative: Framework, Value, and the Role of Ansible Automation Platform
DevOps Cloud Academy
DevOps Cloud Academy
Aug 1, 2022 · Operations

Future DevOps Trends Since 2022: Practices, Case Studies, and Impact

This article examines post‑2022 DevOps trends—including GitOps, AIOps/MLOps, DevSecOps, FinOps, DataOps, chaos engineering, SRE, hybrid deployment, automation, IaC, serverless, cloud‑native, microservices, containerization, Kubernetes, edge computing, data observability, and platform engineering—illustrating each with real‑world case studies that show measurable improvements in speed, reliability, cost, and scalability.

DevOpsKubernetesMicroservices
0 likes · 20 min read
Future DevOps Trends Since 2022: Practices, Case Studies, and Impact
Efficient Ops
Efficient Ops
Jul 28, 2022 · Operations

How Zhongyuan Bank Achieved Industry-Leading DevOps Efficiency with a Unified Measurement Model

On July 28, 2022, China Academy of Information and Communications Technology announced that Zhongyuan Bank's R&D Efficiency Insight Platform passed the DevOps General Efficiency Measurement Model assessment at the industry promotion level, highlighting the bank's leading DevOps practices, detailed interview insights, and future development plans.

Banking TechnologyCloud NativeDevOps
0 likes · 12 min read
How Zhongyuan Bank Achieved Industry-Leading DevOps Efficiency with a Unified Measurement Model
Efficient Ops
Efficient Ops
Jul 28, 2022 · Operations

How Zhejiang Commercial Bank Achieved Industry-Leading DevOps Maturity

Zhejiang Commercial Bank’s Financial Technology team discusses how their industry‑first DevOps continuous‑delivery platform earned a Level‑3 assessment from the China Academy of Information and Communications Technology, highlighting the standards, tools, metrics, challenges, and future plans that propelled their supply‑chain finance service platform to a leading position in digital transformation.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 13 min read
How Zhejiang Commercial Bank Achieved Industry-Leading DevOps Maturity
dbaplus Community
dbaplus Community
Jul 25, 2022 · Operations

How to Build an Enterprise‑Grade Monitoring & Alerting Platform with Prometheus, Grafana, and AlertManager

Designing a unified, enterprise‑level monitoring and alerting platform, this article analyzes the shortcomings of standard Prometheus‑Grafana‑AlertManager setups, outlines platform‑vs‑business responsibilities, details architecture, user‑scenario requirements, component selection, high‑availability strategies, and deployment models to achieve scalable, easy‑to‑use observability.

Operationshigh-availabilitymonitoring
0 likes · 12 min read
How to Build an Enterprise‑Grade Monitoring & Alerting Platform with Prometheus, Grafana, and AlertManager
Tencent Cloud Middleware
Tencent Cloud Middleware
Jul 25, 2022 · Operations

How Tencent Scaled Apache Pulsar to Process Billions of Messages Daily – Ops Lessons & Tuning Guide

This article details the design, deployment, and performance‑tuning techniques used by Tencent's TEG Data Platform to operate a massive Apache Pulsar cluster handling trillions of messages per day, covering server‑side parameters, client‑side configurations, common bottlenecks, and step‑by‑step troubleshooting advice.

Apache PulsarGo SDKLarge Scale Messaging
0 likes · 22 min read
How Tencent Scaled Apache Pulsar to Process Billions of Messages Daily – Ops Lessons & Tuning Guide
Ops Development Stories
Ops Development Stories
Jul 25, 2022 · Operations

Why Did My Windows Server Run Out of Ports? A Real‑World Debugging Tale

After a birthday outing, a payment failure revealed a Windows server’s inability to connect to a bank, leading the author through CPU/memory checks, network diagnostics, port exhaustion discovery, registry tweaks, malicious process removal, and final remediation, highlighting essential operations lessons.

OperationsTCP portsWindows server
0 likes · 6 min read
Why Did My Windows Server Run Out of Ports? A Real‑World Debugging Tale
FunTester
FunTester
Jul 24, 2022 · Operations

Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation

Chaos engineering, a discipline for experimenting on distributed systems, helps teams identify hidden weaknesses, improve high‑availability, and build confidence in production by defining stable states, injecting realistic failures, and measuring impact through observability metrics, with practical steps, tool choices, maturity stages, and evaluation methods.

Distributed SystemsFault InjectionOperations
0 likes · 11 min read
Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation
Programmer DD
Programmer DD
Jul 22, 2022 · Operations

How to Shrink Log Files: Cut 5GB Daily Logs to Under 1GB with Proven Techniques

This article explains practical methods for reducing oversized log files—such as printing only essential logs, merging entries, and simplifying messages—illustrated with code examples and a real‑world case study that lowered daily log volume from 5 GB to under 1 GB while preserving debugging capability.

Operationslog optimization
0 likes · 8 min read
How to Shrink Log Files: Cut 5GB Daily Logs to Under 1GB with Proven Techniques
MaGe Linux Operations
MaGe Linux Operations
Jul 20, 2022 · Operations

What a Typical Ops Day Looks Like—and How to Make It More Productive

The author recounts a chaotic typical day for Chinese operations engineers, then proposes a balanced schedule that prioritizes urgent firefighting tasks while dedicating most time to proactive monitoring, performance tuning, tool development, and continuous learning for long‑term system stability.

DevOpsOperationsmonitoring
0 likes · 4 min read
What a Typical Ops Day Looks Like—and How to Make It More Productive
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jul 19, 2022 · Cloud Native

Unlock SaaS Success: Cloud‑Native Multi‑Tenant Architecture Essentials

This article recaps Huawei Cloud's DTT tech talk on SaaS cloud‑native applications, covering key challenges such as infrastructure selection, system and data‑storage design, multi‑tenant routing, various tenant isolation models, cloud‑native services like CCE and CSE, and the comprehensive SaaS support plan for developers.

Huawei CloudOperationsSaaS
0 likes · 14 min read
Unlock SaaS Success: Cloud‑Native Multi‑Tenant Architecture Essentials
IT Architects Alliance
IT Architects Alliance
Jul 18, 2022 · Operations

Comparison of Prometheus and Zabbix Monitoring Solutions

This article compares Prometheus and Zabbix, outlining their histories, architectures, storage models, configuration complexity, community activity, and suitability for different environments, and concludes with recommendations on when to choose each monitoring system.

ComparisonOperationsPrometheus
0 likes · 9 min read
Comparison of Prometheus and Zabbix Monitoring Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 18, 2022 · Big Data

Systematic Data Governance Practices in Meituan Accommodation Business

This article details Meituan's accommodation data governance team's evolution toward an automated, systematic, and standardized governance framework, covering background challenges, the conceptualization of a comprehensive governance system, its practical implementation across standardization, digitization, and systematization, and the resulting operational benefits and future directions.

Operationsautomationmetadata
0 likes · 30 min read
Systematic Data Governance Practices in Meituan Accommodation Business
21CTO
21CTO
Jul 15, 2022 · Operations

Why Python Is the Top Language for DevOps Engineers

The article explains how DevOps relies on automation tools like Docker and Jenkins, argues that Python’s ease of use, versatility, and automation capabilities make it the optimal programming language for DevOps professionals, and advises continuous learning for operations staff transitioning into DevOps roles.

OperationsPythonautomation
0 likes · 5 min read
Why Python Is the Top Language for DevOps Engineers
Software Development Quality
Software Development Quality
Jul 14, 2022 · Operations

Mastering Full‑Link Stress Testing and Stability Assurance for Large‑Scale Promotions

This guide details a comprehensive approach to stability assurance and test innovation, covering full‑link stress testing, functional pre‑runs, loss‑prevention, fault drills, efficiency innovation from zero to one, and systematic quality assurance thinking for large‑scale promotional events.

Operationsefficiency innovationfault drills
0 likes · 14 min read
Mastering Full‑Link Stress Testing and Stability Assurance for Large‑Scale Promotions
Big Data Technology Architecture
Big Data Technology Architecture
Jul 14, 2022 · Operations

Postmortem of Bilibili SLB Outage on July 13, 2021

This postmortem details the July 13, 2021 Bilibili outage caused by a Lua‑induced CPU 100% bug in the OpenResty‑based SLB, describing the incident timeline, root‑cause analysis, mitigation steps, and the subsequent technical and process improvements to enhance reliability and multi‑active deployment.

IncidentLoad BalancerLua
0 likes · 16 min read
Postmortem of Bilibili SLB Outage on July 13, 2021
Architect's Guide
Architect's Guide
Jul 14, 2022 · Cloud Native

Understanding Kubernetes: Cloud‑Native Architecture, Deployment Scenarios, and Operational Practices

This article explains what Kubernetes (K8s) is, its core characteristics of portability, extensibility, and automation, and how it enables cloud‑native, container‑based deployments, resource optimization, seamless service migration, and operational automation across enterprises of all sizes.

Cloud NativeContainerizationKubernetes
0 likes · 10 min read
Understanding Kubernetes: Cloud‑Native Architecture, Deployment Scenarios, and Operational Practices
Cloud Native Technology Community
Cloud Native Technology Community
Jul 12, 2022 · Cloud Native

How Tencent Cut Kubernetes CPU Costs by 70%: A Full‑Scale Cloud‑Native Optimization Journey

This article presents a comprehensive, data‑driven case study of how Tencent’s internal Kubernetes/TKE platform reduced monthly CPU usage by up to 70% and memory usage by 50% through systematic cost data collection, VPA/HPA enhancements, custom scheduling, node‑level over‑commit, and safe node decommissioning, while maintaining zero‑incident reliability.

Cloud NativeCost OptimizationHPA
0 likes · 28 min read
How Tencent Cut Kubernetes CPU Costs by 70%: A Full‑Scale Cloud‑Native Optimization Journey
Top Architect
Top Architect
Jul 10, 2022 · Operations

Fundamental Principles and Implementation of a Payment System Reconciliation Center

This article explains how a payment company's reconciliation center matches internal transaction records with bank clearing data, detailing the clearing reconciliation system, reconciliation definitions, data sources, processing logic for one‑to‑one, many‑to‑many, and one‑to‑many scenarios, as well as module functions, exception handling, and data recovery procedures.

OperationsReconciliationaccounting
0 likes · 28 min read
Fundamental Principles and Implementation of a Payment System Reconciliation Center
Software Development Quality
Software Development Quality
Jul 9, 2022 · Operations

Boost Deployment Efficiency with Structured Environment Management

This guide outlines how to classify, configure, and automate development, integration, UAT, pre‑production, and production environments, establishing principles, standards, recommended practices, common operational steps, and key metrics to improve deployment efficiency and maintain security and compliance.

DeploymentInfrastructure as CodeOperations
0 likes · 7 min read
Boost Deployment Efficiency with Structured Environment Management
Selected Java Interview Questions
Selected Java Interview Questions
Jul 6, 2022 · Operations

Grafana 9.0 New Features and Improvements Overview

Grafana 9.0 introduces a suite of usability enhancements—including a visual Prometheus query builder, a visual Loki LogQL generator, improved Explore‑to‑dashboard workflow, revamped heatmap panel, command palette, panel search, trace panel, navigation upgrades, and alerting refinements—aimed at simplifying observability, data visualization, and operational efficiency.

AlertingDashboardGrafana
0 likes · 7 min read
Grafana 9.0 New Features and Improvements Overview
Architects Research Society
Architects Research Society
Jul 5, 2022 · Backend Development

Handling Timeouts in Microservices: Strategies and Best Practices

This article explains why timeouts are inevitable in distributed systems, illustrates the challenges they create, and presents five practical strategies—including using defaults, safe retries, status checks, and user‑focused fallback—to manage slow or failed API calls in microservice architectures.

BackendOperationsretry strategies
0 likes · 17 min read
Handling Timeouts in Microservices: Strategies and Best Practices
dbaplus Community
dbaplus Community
Jul 4, 2022 · Operations

Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer

A seasoned operations professional shares personal experiences and hard‑earned insights on why traditional monitoring often becomes ineffective, how over‑automation and noisy dashboards hurt teams, and what a capability‑focused, user‑centric approach to observability should look like.

OperationsSREmonitoring
0 likes · 12 min read
Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer
DevOps Cloud Academy
DevOps Cloud Academy
Jul 2, 2022 · Operations

Understanding AWS-Enabled DevOps Practices and Key CI/CD Tools

This article explains the DevOps methodology in the context of AWS, defines its relationship to cloud operations, and introduces three core AWS CI/CD services—CodeCommit, CodePipeline, and CodeBuild—detailing their features and benefits for accelerating software delivery.

AWSCodeBuildCodeCommit
0 likes · 6 min read
Understanding AWS-Enabled DevOps Practices and Key CI/CD Tools
Model Perspective
Model Perspective
Jul 2, 2022 · Operations

Top Resources for Evaluation & Optimization Models – A Curated Guide

This article compiles and categorizes recent model‑related publications, offering a comprehensive list of evaluation‑model resources—including concepts, preprocessing techniques, weighting methods, and various algorithms—and optimization‑model references covering linear and integer programming, graph theory, network flows, and meta‑heuristics.

Linear ProgrammingModelingOperations
0 likes · 4 min read
Top Resources for Evaluation & Optimization Models – A Curated Guide
Architects Research Society
Architects Research Society
Jul 2, 2022 · Operations

Reliability vs Resilience: Understanding the Difference and Its Importance

Reliability and resilience are distinct yet complementary goals for cloud services; reliability is the outcome of consistently meeting performance expectations, while resilience describes a system’s ability to continue operating despite failures, and this article introduces the concepts and outlines a four‑part series exploring related threats and enhancement techniques.

Cloud ServicesOperationsReliability
0 likes · 6 min read
Reliability vs Resilience: Understanding the Difference and Its Importance
Architecture Digest
Architecture Digest
Jul 2, 2022 · Operations

Design and Evolution of Vivo Server‑Side Monitoring System

This article systematically outlines the design, components, data flow, and evolution of Vivo’s server‑side monitoring system, covering data collection, transmission, storage with OpenTSDB, visualization, alerting mechanisms, and comparisons with other monitoring solutions.

AlertingOpenTSDBOperations
0 likes · 19 min read
Design and Evolution of Vivo Server‑Side Monitoring System
DevOps Cloud Academy
DevOps Cloud Academy
Jul 1, 2022 · Operations

How DevOps Can Help Reduce Technical Debt

This article explains what technical debt is, why it arises, and how adopting DevOps practices such as cross‑functional teams, automation, infrastructure‑as‑code, containerization, and API‑centric design can identify, track, and repay technical debt to improve system reliability and agility.

ContainersDevOpsInfrastructure as Code
0 likes · 9 min read
How DevOps Can Help Reduce Technical Debt
Model Perspective
Model Perspective
Jun 26, 2022 · Fundamentals

What Is Decision Analysis? Core Concepts, Elements, Types, and Steps

This article explains decision analysis by defining its concept, outlining basic elements such as decision makers, goals, alternatives, natural states, outcomes and criteria, classifying decisions into deterministic, risk, competitive and sequential types, and describing the five-step scientific decision‑making process.

Operationsdecision analysisdecision making
0 likes · 7 min read
What Is Decision Analysis? Core Concepts, Elements, Types, and Steps
php Courses
php Courses
Jun 23, 2022 · Operations

BT (BaoTa) Panel Installation and Initialization Guide

This article provides a step‑by‑step tutorial for beginners on installing, initializing, and using the BT (BaoTa) server management panel on Linux or Windows, covering cloud server setup, firewall rules, domain binding, UI access, package selection, and optional software installations.

BT PanelInstallationOperations
0 likes · 5 min read
BT (BaoTa) Panel Installation and Initialization Guide
Hulu Beijing
Hulu Beijing
Jun 23, 2022 · Operations

How to Optimize Ad Traffic Allocation with Front‑Load Curves and PID Control

This article explains how to prioritize ad orders, use front‑loading to smooth traffic fluctuations, model delivery constraints with differential equations, and apply PID‑based selection coefficients to achieve efficient, real‑time traffic allocation in streaming advertising systems.

OperationsPID controlad allocation
0 likes · 13 min read
How to Optimize Ad Traffic Allocation with Front‑Load Curves and PID Control
Efficient Ops
Efficient Ops
Jun 22, 2022 · Operations

Top 13 Essential Linux Ops Tools Every Sysadmin Should Master

This guide introduces thirteen practical Linux operations tools—from network bandwidth monitors like Nethogs to security scanners such as NMap—providing concise descriptions, installation commands, and usage tips to help system administrators efficiently manage and secure their servers.

OperationsSysadminmonitoring
0 likes · 12 min read
Top 13 Essential Linux Ops Tools Every Sysadmin Should Master
Inke Technology
Inke Technology
Jun 22, 2022 · Operations

How InnoLive Cut Monitoring Costs by 86% with Nightingale

This article details InnoLive's migration from Open‑Falcon to the Nightingale monitoring platform, describing the pain points of their previous system, the selection process, deployment architecture, collection practices, and the substantial cost and performance benefits achieved.

Cost reductionOpen-FalconOperations
0 likes · 10 min read
How InnoLive Cut Monitoring Costs by 86% with Nightingale
Qunar Tech Salon
Qunar Tech Salon
Jun 22, 2022 · Operations

Design and Implementation of Multi‑Cluster HPA Metrics Collection, Analysis, and Reporting in Kubernetes

This article explains the background, benefits, and measurement criteria of Kubernetes Horizontal‑Pod‑Autoscaler (HPA), describes the creation of metric tables and SQL queries for collecting scaling events and CPU usage, and presents a Python‑based workflow that aggregates the data, stores daily reports, validates results, and sends automated email summaries.

HPAKubernetesOperations
0 likes · 19 min read
Design and Implementation of Multi‑Cluster HPA Metrics Collection, Analysis, and Reporting in Kubernetes