Tagged articles
3281 articles
Page 8 of 33
DevOps Operations Practice
DevOps Operations Practice
Jul 4, 2024 · Operations

Building an Enterprise‑Level Monitoring System: Requirements, Technology Selection, Architecture, Implementation Steps, and Maintenance

This article provides a comprehensive guide to designing and deploying an enterprise‑grade monitoring system, covering requirement analysis, tool selection such as Prometheus and Zabbix, system architecture, step‑by‑step implementation, alerting, visualization, and ongoing maintenance to ensure reliable IT operations.

AlertingGrafanaOperations
0 likes · 7 min read
Building an Enterprise‑Level Monitoring System: Requirements, Technology Selection, Architecture, Implementation Steps, and Maintenance
Efficient Ops
Efficient Ops
Jul 3, 2024 · Operations

How Shanghai Stock Exchange Secured Dual DevOps International and Domestic Certification

The article details China's 2024‑2027 Information Standard Construction Action Plan, the launch of synchronized ITU DevOps international and domestic assessments, and how Shanghai Stock Exchange's website and app platform successfully passed both standards, highlighting the significance for national standardization and operational excellence.

ChinaDevOpsInternational Standards
0 likes · 8 min read
How Shanghai Stock Exchange Secured Dual DevOps International and Domestic Certification
JD Cloud Developers
JD Cloud Developers
Jul 2, 2024 · Operations

How Large Language Models Are Transforming Modern IT Operations

From manual server management to automated scripts, AIOps, and ChatOps, this article traces the evolution of IT operations and demonstrates how large language models boost efficiency, enable intelligent assistants, automated diagnostics, and smart log analysis, aiming for rapid fault detection, localization, and resolution.

ChatOpsLarge Language ModelsOperations
0 likes · 7 min read
How Large Language Models Are Transforming Modern IT Operations
DevOps Coach
DevOps Coach
Jun 30, 2024 · Operations

Effective Incident Mitigation and Recovery: Practical SRE Strategies

The article outlines SRE‑based incident mitigation and recovery practices, covering urgent mitigations, impact reduction, key metrics such as TTD, TTR, TBF, and detailed strategies for shortening detection and repair times, preventing fatigue, improving observability, and designing resilient systems.

MitigationOperationsReliability
0 likes · 23 min read
Effective Incident Mitigation and Recovery: Practical SRE Strategies
Efficient Ops
Efficient Ops
Jun 28, 2024 · Operations

How China’s Agricultural Bank Achieved Dual DevOps Certification and Set a New Industry Benchmark

The Agricultural Bank of China’s Digital Twin Platform passed both the ITU DevOps international standard and the domestic DevOps Level‑3 Continuous Delivery assessment, highlighting China’s push for internationalized information standards and showcasing the broader rollout of synchronized DevOps evaluations across the nation.

ChinaDevOpsDigital Twin
0 likes · 7 min read
How China’s Agricultural Bank Achieved Dual DevOps Certification and Set a New Industry Benchmark
Efficient Ops
Efficient Ops
Jun 28, 2024 · Operations

How Shandong City Commercial Bank Alliance Earned Leading DevOps Dual‑Certification

The article details China’s 2024‑2027 Information Standard Construction Action Plan, the launch of synchronized ITU DevOps international and domestic assessments, and a case study of Shandong City Commercial Bank Alliance’s successful dual‑certification, highlighting interview insights, performance metrics, and the broader push for standards internationalization.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 14 min read
How Shandong City Commercial Bank Alliance Earned Leading DevOps Dual‑Certification
TAL Education Technology
TAL Education Technology
Jun 27, 2024 · Cloud Native

Case Study: Integrating the AiFenxi BI Platform with Apache APISIX Gateway for Improved Performance and Stability

This case study details how the AiFenxi business intelligence platform integrated Apache APISIX as a high‑performance API gateway within Tencent Cloud TKE, addressing latency, scalability, and security challenges, and outlines the architectural changes, deployment steps, and resulting performance improvements.

APISIXBI platformCloud Native
0 likes · 7 min read
Case Study: Integrating the AiFenxi BI Platform with Apache APISIX Gateway for Improved Performance and Stability
Open Source Linux
Open Source Linux
Jun 27, 2024 · Operations

Comprehensive Guide to Building a Resilient, High‑Performance Web Infrastructure

This guide outlines essential steps for creating a robust, high‑availability website architecture, covering domain acquisition, DNS management, CDN deployment, image caching, data center selection, monitoring, DDoS mitigation, redundancy, server configuration, database replication, testing environments, security practices, and operational tooling.

Cloud ServicesDDoS protectionOperations
0 likes · 12 min read
Comprehensive Guide to Building a Resilient, High‑Performance Web Infrastructure
Sanyou's Java Diary
Sanyou's Java Diary
Jun 24, 2024 · Operations

How Visualized Full‑Chain Log Tracing Transforms Complex Business Systems

This article explains a new visualized full‑chain log tracing solution that organizes business logs by logical flow, dynamically links them during execution, and provides a visual, searchable view of the entire business process, dramatically improving issue localization in large‑scale distributed systems.

BackendOperationslog tracing
0 likes · 26 min read
How Visualized Full‑Chain Log Tracing Transforms Complex Business Systems
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Jun 24, 2024 · Operations

150 Essential Linux Commands Every Sysadmin Should Master

This comprehensive guide lists 150 indispensable Linux commands covering file management, system monitoring, networking, user administration, process control, and more, providing clear explanations to help both beginners and experienced administrators efficiently manage Linux environments.

OperationsShellUnix
0 likes · 25 min read
150 Essential Linux Commands Every Sysadmin Should Master
Software Development Quality
Software Development Quality
Jun 21, 2024 · Operations

Stabilizing Test Environments with a Trunk‑Based Strategy

This article outlines a comprehensive approach to improve test environment stability by introducing a trunk‑based environment as the default, detailing solution architecture, various testing scenarios, implementation steps, and monitoring practices to transition from unstable daily environments to a more reliable testing ecosystem.

DeploymentEnvironmentOperations
0 likes · 14 min read
Stabilizing Test Environments with a Trunk‑Based Strategy
Alipay Experience Technology
Alipay Experience Technology
Jun 19, 2024 · Backend Development

How Alipay’s “Mantiantianxing” Platform Boosts Development and Operations Efficiency

This article details how Alipay’s Mantiantianxing platform was designed and built to unify page construction, fine‑grained operation, and data feedback, thereby solving low R&D and operation efficiency, reducing duplication, and enabling rapid, scalable innovation across multiple product front‑ends and back‑ends.

BackendOperationsarchitecture
0 likes · 22 min read
How Alipay’s “Mantiantianxing” Platform Boosts Development and Operations Efficiency
Software Development Quality
Software Development Quality
Jun 19, 2024 · Operations

Best Practices for Test Data Management and Usage

This guide outlines comprehensive principles for generating, using, and cleaning test data across development, performance, and production environments, emphasizing independence, realism, security, proper permission controls, and systematic synchronization to ensure reliable and safe testing processes.

Data ManagementOperationsSoftware Testing
0 likes · 6 min read
Best Practices for Test Data Management and Usage
JD Tech Talk
JD Tech Talk
Jun 17, 2024 · Cloud Computing

Cost Governance for Enterprise IT in the Cloud Era

This article examines how cloud computing has become central to enterprise IT architecture, explores its cost governance challenges, outlines industry trends, standards like ITIL and COBIT, and presents practical strategies—including FinOps, multi‑cloud platforms, and sustainable practices—to effectively manage and reduce IT costs.

FinOpsIT cost governanceOperations
0 likes · 54 min read
Cost Governance for Enterprise IT in the Cloud Era
High Availability Architecture
High Availability Architecture
Jun 14, 2024 · Operations

Evolution and Practice of Vivo CICD Artifact Management in DevOps

This article details the evolution of Vivo's CICD artifact management across four stages, explains its core functions such as multi‑type support, unified storage, promotion, security scanning, aging, and permission control, and outlines future directions toward smarter, more integrated, and secure DevOps operations.

Artifact ManagementCICDContinuous Delivery
0 likes · 16 min read
Evolution and Practice of Vivo CICD Artifact Management in DevOps
Practical DevOps Architecture
Practical DevOps Architecture
Jun 13, 2024 · Operations

Comprehensive Data Center Operations Training Course Overview

This extensive training program covers everything a data center operations engineer needs—from foundational infrastructure management and server hardware maintenance to advanced network configuration, security hardening, monitoring, fault handling, and practical hands‑on skills for real‑world challenges.

InfrastructureOperationsdata center
0 likes · 6 min read
Comprehensive Data Center Operations Training Course Overview
Qunar Tech Salon
Qunar Tech Salon
Jun 12, 2024 · Artificial Intelligence

Design and Implementation of Qunar Flight Ticket Intelligent Alert (Radar) System

This article presents a comprehensive analysis and engineering of Qunar's flight‑ticket intelligent pre‑warning (Radar) system, covering the business need, value analysis, architectural redesign, feature extraction, indicator classification, accuracy quantification, multi‑algorithm anomaly detection, automatic parameter tuning, observed effects, and future plans to incorporate large‑model techniques.

Operationsanomaly detectionflight ticket
0 likes · 17 min read
Design and Implementation of Qunar Flight Ticket Intelligent Alert (Radar) System
Efficient Ops
Efficient Ops
Jun 10, 2024 · Operations

Why IPv4 Is Running Out and How Companies Can Navigate the Costly IPv6 Migration

With IPv4 address space exhausted and providers beginning to charge for public IPv4 usage, organizations face rising costs and complex migration challenges, prompting a strategic shift toward IPv6 adoption, alternative solutions, or passing expenses to customers, while grappling with ISP support gaps and tooling limitations.

IPv4IPv6Network Migration
0 likes · 13 min read
Why IPv4 Is Running Out and How Companies Can Navigate the Costly IPv6 Migration
Open Source Tech Hub
Open Source Tech Hub
Jun 8, 2024 · Operations

Docker Hub Mirror Service Stopped – Find Fast Alternative Registries

The Shanghai Jiao Tong University mirror announced the shutdown of its Docker Hub accelerator on June 6, prompting users to replace the unavailable address with other domestic mirrors such as NetEase, Alibaba Cloud, Baidu, and Nanjing University to maintain fast container image downloads.

Container RegistryMirrorOperations
0 likes · 3 min read
Docker Hub Mirror Service Stopped – Find Fast Alternative Registries
Python Programming Learning Circle
Python Programming Learning Circle
Jun 3, 2024 · Operations

Using Python for Operations Automation: Remote Execution, Log Parsing, Monitoring, Deployment, and Backup

This article demonstrates how Python can automate common operations tasks such as remote command execution, log file parsing, system monitoring with alerts, batch software deployment, and file backup and recovery, providing code examples using libraries like paramiko, regex, psutil, fabric, and shutil.

DevOpsOperationsPython
0 likes · 5 min read
Using Python for Operations Automation: Remote Execution, Log Parsing, Monitoring, Deployment, and Backup
MaGe Linux Operations
MaGe Linux Operations
May 31, 2024 · Operations

Mastering journalctl: Powerful Techniques to Query systemd Logs

This guide explains how to use the journalctl command to view, filter, and manage systemd-journald logs on Linux, covering help options, match expressions, persistent storage, disk usage, vacuuming, time ranges, unit filtering, priority levels, real‑time streaming, output formatting, and kernel log access.

OperationsSystemdjournalctl
0 likes · 13 min read
Mastering journalctl: Powerful Techniques to Query systemd Logs
Liangxu Linux
Liangxu Linux
May 30, 2024 · Operations

Why Do Most Servers Run Linux? Historical and Technical Reasons Explained

This article compiles several Zhihu answers that trace the historical shift from Windows/IIS to Linux-based servers, highlighting ecosystem dynamics, cost advantages, performance differences, container support, and open‑source adoption that together explain why Linux dominates modern server environments.

IISLinuxOperations
0 likes · 10 min read
Why Do Most Servers Run Linux? Historical and Technical Reasons Explained
DevOps Engineer
DevOps Engineer
May 29, 2024 · Cloud Computing

Overview of the Python Software Foundation (PSF) Infrastructure

The article provides a comprehensive overview of the Python Software Foundation's infrastructure, detailing its team, cloud providers, data centers, and the hosting arrangements for numerous services such as PyPI, docs, bug trackers, and the main Python website.

Cloud ServicesInfrastructureOperations
0 likes · 9 min read
Overview of the Python Software Foundation (PSF) Infrastructure
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
May 26, 2024 · Operations

9 Essential Metrics for Effective Microservice Monitoring

This article outlines nine crucial microservice monitoring indicators—including request tracing, health checks, throughput, response time, success and error rates, concurrent connections, CPU/memory usage, and resource utilization—to help engineers assess performance and reliability in distributed systems.

Operationsmicroservice monitoringperformance metrics
0 likes · 8 min read
9 Essential Metrics for Effective Microservice Monitoring
21CTO
21CTO
May 23, 2024 · Operations

What a Solo Founder Learned from Scaling TinyPilot to $800K Revenue

The author recounts five years of building TinyPilot, detailing revenue growth, product launches, team expansion, operational challenges, cost management, and personal reflections on entrepreneurship, while sharing lessons learned and goals for the coming year.

EntrepreneurshipHardwareOperations
0 likes · 15 min read
What a Solo Founder Learned from Scaling TinyPilot to $800K Revenue
Architect's Tech Stack
Architect's Tech Stack
May 18, 2024 · Operations

Graceful Shutdown in Kubernetes and Spring Boot Microservices: Best Practices and Optimizations

This article explains the concept of graceful shutdown, outlines essential steps, examines Kubernetes pod termination and Spring Boot integration with Nacos, and provides practical optimizations—including PreStop hooks, terminationGracePeriod settings, and actuator shutdown—to ensure reliable service termination without data loss.

Cloud NativeGraceful ShutdownKubernetes
0 likes · 11 min read
Graceful Shutdown in Kubernetes and Spring Boot Microservices: Best Practices and Optimizations
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
May 18, 2024 · Operations

Mastering Gray Release: Safely Deploy Updates in Large‑Scale Systems

This article explains the concept of gray (canary) release, why it’s essential for large‑scale architectures, outlines the step‑by‑step workflow, describes common traffic‑splitting strategies, and offers practical tips for monitoring and gradually scaling deployments to ensure system stability.

Deployment StrategyOperationscanary deployment
0 likes · 4 min read
Mastering Gray Release: Safely Deploy Updates in Large‑Scale Systems
Cognitive Technology Team
Cognitive Technology Team
May 16, 2024 · Operations

Guide to Building Stability in Distributed Systems

This guide presents comprehensive principles, best practices, and techniques for designing, deploying, and maintaining stable distributed systems, covering fault tolerance, monitoring, capacity planning, incident response, and operational reliability to help engineers achieve high availability.

Distributed SystemsOperationsreliability engineering
0 likes · 1 min read
Guide to Building Stability in Distributed Systems
Cognitive Technology Team
Cognitive Technology Team
May 16, 2024 · Operations

Core Principles of High‑Availability Architecture Design

These core principles—minimal dependency, weak dependency, distribution, rate limiting, degradable design, balanced risk, fault prevention and isolation, no single point of failure, self‑protection, automatic failover, and retry/idempotency/compensation—guide the design of highly available systems by reducing risk, ensuring redundancy, and protecting services at all layers.

OperationsReliabilitySystem Design
0 likes · 3 min read
Core Principles of High‑Availability Architecture Design
Efficient Ops
Efficient Ops
May 14, 2024 · Operations

China’s Top Banks Lead DevOps Maturity: Insights from CAICT’s Model

China’s banks are rapidly adopting DevOps, with CAICT’s maturity model showing China Merchants Bank topping assessments across multiple years, highlighting how standardized DevOps practices boost IT efficiency, product delivery speed, and business satisfaction in the era of digital transformation.

DevOpsDigital TransformationMaturity Model
0 likes · 9 min read
China’s Top Banks Lead DevOps Maturity: Insights from CAICT’s Model
DataFunTalk
DataFunTalk
May 14, 2024 · Cloud Computing

Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio

This article describes Zhihu's hybrid cloud architecture—including offline, online, and GPU data centers—its self‑built UnionStore cache, the performance and latency challenges faced during large‑scale AI model training, and the subsequent evaluation and migration to Alluxio community and enterprise editions to achieve higher throughput, stability, and lower operational overhead.

AI storageAlluxioBig Data
0 likes · 14 min read
Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio
Programmer DD
Programmer DD
May 14, 2024 · Operations

Mastering Full‑Link Load Testing: The Ultimate Guide to Capacity Assurance

This article explains the concept, challenges, step‑by‑step process, organizational and tool requirements, capacity governance, planning, and AI‑driven prediction for full‑link load testing, illustrating how enterprises can ensure system capacity and stability during large‑scale online events.

OperationsPerformance Testingcapacity assurance
0 likes · 9 min read
Mastering Full‑Link Load Testing: The Ultimate Guide to Capacity Assurance
dbaplus Community
dbaplus Community
May 13, 2024 · Cloud Native

Do You Really Need Kubernetes? Real‑World Dev Opinions and Practical Tips

This article compiles diverse Zhihu answers discussing whether Kubernetes is necessary, weighing its automation benefits and scaling power against configuration complexity, resource costs, and team readiness, while offering concrete kubectl commands and guidance for making an informed adoption decision.

Cloud NativeKubernetesOperations
0 likes · 9 min read
Do You Really Need Kubernetes? Real‑World Dev Opinions and Practical Tips
Qunar Tech Salon
Qunar Tech Salon
May 13, 2024 · Operations

Root Cause Analysis of Intermittent Timeout Issues in the Sirius Service Caused by RAID Card Consistency Checks

This article details the investigation of sporadic interface timeouts in the Sirius real‑time pricing service, revealing a weekly pattern linked to RAID controller consistency checks that cause IO spikes, logback queue blockage, and ultimately Dubbo client‑side timeouts, and proposes mitigation steps and general performance‑troubleshooting guidelines.

OperationsRAIDlogback
0 likes · 22 min read
Root Cause Analysis of Intermittent Timeout Issues in the Sirius Service Caused by RAID Card Consistency Checks
Open Source Linux
Open Source Linux
May 13, 2024 · Information Security

What Is a Bastion Host and Why It’s Critical for Secure Operations

This article explains what a bastion host (jump server) is, why it evolved from traditional jump servers, its core 4A design (authentication, authorization, account, audit), deployment options, common features, authentication methods, and how open‑source and commercial solutions differ, helping organizations improve security and compliance.

AuthenticationBastion HostOperations
0 likes · 10 min read
What Is a Bastion Host and Why It’s Critical for Secure Operations
Efficient Ops
Efficient Ops
May 12, 2024 · Operations

What Is the New DevOps International Standard and How Does It Shape Cloud Service Development?

The article outlines the DevOps International Standard (ITU‑T Y.3525), its development history, publication, evaluation scheme upgrades, relationship with China’s domestic DevOps standards, and provides a comprehensive overview of industry participation in the DevOps capability maturity model as of April 2024.

DevOpsInternational StandardMaturity Model
0 likes · 9 min read
What Is the New DevOps International Standard and How Does It Shape Cloud Service Development?
Efficient Ops
Efficient Ops
May 12, 2024 · Operations

How China’s Agricultural Bank Leads DevOps Maturity Across Multiple Projects

The article details how China’s Agricultural Bank of China has leveraged the CAICT DevOps Capability Maturity Model to achieve extensive assessments across dozens of projects, illustrating the bank’s digital transformation, security improvements, and operational benefits within the broader national push toward intelligent, networked enterprises.

DevOpsDigital TransformationMaturity Model
0 likes · 13 min read
How China’s Agricultural Bank Leads DevOps Maturity Across Multiple Projects
ITPUB
ITPUB
May 10, 2024 · Databases

Choosing Low‑Risk Strategies for Critical DBA Outages

When a major operations incident strikes, the safest approach is to prioritize simple, low‑risk actions and accept limited responsibility, as illustrated by real DBA lessons from Oracle RAC failures and a data‑center power‑loss disaster.

DBAOperationsOracle RAC
0 likes · 7 min read
Choosing Low‑Risk Strategies for Critical DBA Outages
Software Development Quality
Software Development Quality
May 10, 2024 · Operations

Mastering Software Deployment: From Development to Production Environments

This guide explains the purpose and characteristics of development, test, pre‑release, gray‑scale, and production environments, outlines deployment methods, key considerations, phased strategies, environment differences, testing data construction, and synchronization practices to improve software development quality and efficiency.

DeploymentDevOpsEnvironment
0 likes · 6 min read
Mastering Software Deployment: From Development to Production Environments
Efficient Ops
Efficient Ops
May 9, 2024 · Operations

Understanding the New ITU‑T Y.3525 DevOps Standard: Implications for Cloud Operations

The ITU‑T Y.3525 DevOps international standard, aligned with China’s domestic DevOps evaluation, was announced on April 25, 2024, detailing its development history, evaluation scheme, upgraded certification, and the relationship with domestic standards, while showcasing industry participation data and contact information for assessments.

DevOpsInternational StandardMaturity Model
0 likes · 12 min read
Understanding the New ITU‑T Y.3525 DevOps Standard: Implications for Cloud Operations
Efficient Ops
Efficient Ops
May 9, 2024 · Operations

How China’s Agricultural Bank Leads the Way in DevOps Maturity

Amid China's digital transformation wave, the Agricultural Bank of China has leveraged the CAICT‑led DevOps Capability Maturity Model to achieve record‑setting assessments across multiple projects, demonstrating how systematic DevOps adoption can boost security, efficiency, and innovation in large financial institutions.

DevOpsDigital TransformationMaturity Model
0 likes · 14 min read
How China’s Agricultural Bank Leads the Way in DevOps Maturity
Efficient Ops
Efficient Ops
May 6, 2024 · Cloud Native

Why Is My Kubernetes Pod OOMKilled Before Reaching Its Memory Limit?

A Kubernetes pod repeatedly restarted with exit code 137 despite not hitting its memory limit, revealing that node‑level memory pressure and QoS‑based eviction caused the pod to be killed, and outlining how to diagnose and prevent such OOMKill events.

Cloud NativeKubernetesOOMKill
0 likes · 9 min read
Why Is My Kubernetes Pod OOMKilled Before Reaching Its Memory Limit?
Architects Research Society
Architects Research Society
May 3, 2024 · Operations

Digital Transformation Framework for Asset Management: Business, Technology, and Cultural Perspectives

The article presents a comprehensive Digital Transformation Framework (DTF) for asset‑management firms, detailing how financial, economic, risk, and technical dimensions intertwine, illustrating the composable‑enterprise model, middle‑office evolution, API‑centric automation, and the essential cultural shift required for successful digital change.

APIBusiness strategyDigital Transformation
0 likes · 17 min read
Digital Transformation Framework for Asset Management: Business, Technology, and Cultural Perspectives
Bilibili Tech
Bilibili Tech
Apr 30, 2024 · Industry Insights

How Bilibili’s Smart Cabling Platform Boosts Data Center Efficiency

This article examines Bilibili's data‑center cabling challenges and presents a smart management platform that digitizes design, automates routing with scenario‑based and shortest‑path algorithms, streamlines task creation and operation, ultimately reducing installation time and improving maintenance efficiency.

CablingInfrastructureManagement
0 likes · 12 min read
How Bilibili’s Smart Cabling Platform Boosts Data Center Efficiency
Efficient Ops
Efficient Ops
Apr 29, 2024 · Operations

Accelerate Linux Ops: Fast Deletion, iSCSI Detection & Quick Group Management

This guide presents practical Linux and vSphere techniques—including using rsync for rapid bulk file deletion, detecting newly added iSCSI disks without reboot, safeguarding rm commands, mounting remote filesystems with SSHFS, and quickly adding users to supplementary groups via gpasswd—to boost operational efficiency.

LinuxOperationsgpasswd
0 likes · 10 min read
Accelerate Linux Ops: Fast Deletion, iSCSI Detection & Quick Group Management
Efficient Ops
Efficient Ops
Apr 29, 2024 · Operations

How Minsheng Bank Reached Level‑3 DevOps Continuous Delivery – A Leading‑Edge Case Study

This article details Minsheng Bank’s successful achievement of Level‑3 Continuous Delivery in the CAICT DevOps maturity model, highlighting the assessment process, interview insights from senior tech leaders, measurable efficiency gains, and the broader significance of standardized DevOps practices for enterprise digital transformation.

Continuous DeliveryDevOpsMaturity Model
0 likes · 13 min read
How Minsheng Bank Reached Level‑3 DevOps Continuous Delivery – A Leading‑Edge Case Study
JD Retail Technology
JD Retail Technology
Apr 26, 2024 · Operations

How Isolation Principles Boost System High Availability: Real-World Cases

This article explains the concept of high availability, defines the isolation principle, outlines its implementation across various layers, and presents concrete case studies—including vertical data‑center redesign, dual‑cluster Elasticsearch migration, traffic grouping, and hot‑cold data segregation—to illustrate how isolation improves system resilience.

BackendOperationsSystem Design
0 likes · 15 min read
How Isolation Principles Boost System High Availability: Real-World Cases
Efficient Ops
Efficient Ops
Apr 25, 2024 · Operations

How China Southern Airlines Accelerated Delivery with DevOps: A Real‑World Case Study

China Southern Airlines' BTRIP and POMS projects passed the CAICT DevOps Continuous Delivery Standard, showcasing how standardized DevOps practices, agile development, CI/CD pipelines, and targeted team initiatives dramatically improved software quality, delivery speed, and operational efficiency in the airline industry.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 12 min read
How China Southern Airlines Accelerated Delivery with DevOps: A Real‑World Case Study
21CTO
21CTO
Apr 24, 2024 · Operations

Why Are IT Teams Still Growing in 2023? Insights from the Linux Foundation Survey

According to a recent Linux Foundation survey of 418 IT hiring professionals, 37% of companies expanded their technical staff in 2023 while 34% maintained headcount, with cloud providers leading hiring growth, training cuts, longer onboarding times, and rising turnover concerns shaping the evolving tech talent landscape.

IT hiringLinux FoundationOperations
0 likes · 6 min read
Why Are IT Teams Still Growing in 2023? Insights from the Linux Foundation Survey
Liangxu Linux
Liangxu Linux
Apr 19, 2024 · Operations

Step-by-Step Guide to Diagnose High CPU Usage on Linux

This guide walks you through checking CPU usage, system load, resource consumption, problematic processes, system logs, and performance bottlenecks on a Linux server using common command‑line tools such as top, uptime, pidstat, strace, tail, and perf.

CPULinuxOperations
0 likes · 3 min read
Step-by-Step Guide to Diagnose High CPU Usage on Linux
DeWu Technology
DeWu Technology
Apr 19, 2024 · Operations

How to Safeguard B‑End Link Configurations: System Limits, Front‑End Checks, and Automated Alerts

This article analyzes the risks of incorrect B‑end link configurations in fast‑moving business environments and presents a comprehensive protection framework—including system‑level design constraints, front‑end inspections, log‑based alerts, and UI automation—to ensure link accuracy, stability, and rapid issue resolution.

Operationsautomationfrontend inspection
0 likes · 8 min read
How to Safeguard B‑End Link Configurations: System Limits, Front‑End Checks, and Automated Alerts
Cognitive Technology Team
Cognitive Technology Team
Apr 15, 2024 · Operations

Tencent Cloud Service Outage on April 8: Root Cause, Impact, and Improvement Measures

On April 8, Tencent Cloud experienced a major service outage caused by a cloud API failure that prevented console login and disrupted several public cloud services for 87 minutes, prompting a detailed post‑mortem that outlines the root cause, impact, and a series of operational and change‑management improvements.

OperationsTencent Cloudchange management
0 likes · 4 min read
Tencent Cloud Service Outage on April 8: Root Cause, Impact, and Improvement Measures
Test Development Learning Exchange
Test Development Learning Exchange
Apr 12, 2024 · Operations

Python Data Backup Scripts and Tools Overview

This article introduces various Python-based data backup techniques, covering standard library modules such as shutil, zipfile, and tarfile, as well as database dump tools like pg_dump and mysqldump, and cloud storage options using awscli or boto3, with example code snippets for each method.

OperationsPythonScripting
0 likes · 4 min read
Python Data Backup Scripts and Tools Overview
Architect Chen
Architect Chen
Apr 10, 2024 · Operations

Mastering Load Balancing: Algorithms, Nginx Setup, and Real‑World Use Cases

This article explains load balancing fundamentals, shows how to configure Nginx for a Tomcat server pool, compares common balancing algorithms, describes OSI‑layer classifications, and outlines typical scenarios such as web farms, application clusters, databases, CDN, and cloud environments.

AlgorithmsBackendOperations
0 likes · 8 min read
Mastering Load Balancing: Algorithms, Nginx Setup, and Real‑World Use Cases
Architect Chen
Architect Chen
Apr 9, 2024 · Backend Development

How to Warm Up Distributed Caches for High‑Concurrency Systems

This article explains what cache pre‑warming is, why it is essential for high‑traffic applications, and compares three practical approaches—scheduled tasks, batch loading, and manual trigger APIs—highlighting their advantages, drawbacks, and typical usage scenarios.

BackendCacheOperations
0 likes · 6 min read
How to Warm Up Distributed Caches for High‑Concurrency Systems
Efficient Ops
Efficient Ops
Apr 8, 2024 · Operations

What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices

This article explains what Site Reliability Engineering (SRE) is, outlines the three main layers of SRE work—Infrastructure, Platform, and Business—covers hiring challenges, daily duties such as deployment, on‑call, SLI/SLO management, capacity planning, user support, and offers practical interview and career advice.

OncallOperationsSRE
0 likes · 22 min read
What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices
AntTech
AntTech
Apr 3, 2024 · Artificial Intelligence

Post‑Mortem of an AI‑Generated Flash‑Sale System Failure at Ant Internal Network

The article analyzes a recent outage of Ant's internal flash‑sale service built with AI‑generated low‑code, explains why the AI‑written business logic was not the cause, details the database capacity bottleneck that triggered a snowball effect, and discusses future automation and operational strategies to prevent similar failures.

AIOperationsdatabase scaling
0 likes · 12 min read
Post‑Mortem of an AI‑Generated Flash‑Sale System Failure at Ant Internal Network
Efficient Ops
Efficient Ops
Apr 2, 2024 · Operations

How Chinese Exchanges Achieve DevOps Maturity: Insights from the 2023 CAICT Survey

The 2023 China DevOps Survey reveals that major securities and futures exchanges have significantly advanced their DevOps maturity, with many achieving level‑3 continuous delivery assessments, reduced delivery cycles, higher test coverage, and faster build times, illustrating the impact of digital transformation on IT efficiency.

ChinaDevOpsDigital Transformation
0 likes · 9 min read
How Chinese Exchanges Achieve DevOps Maturity: Insights from the 2023 CAICT Survey
Efficient Ops
Efficient Ops
Apr 1, 2024 · Operations

How Leading Chinese Insurers Accelerate Digital Transformation with DevOps Maturity

This article reviews the 2023‑2024 DevOps capability maturity assessments of six major Chinese insurance firms, highlighting their certified projects, the continuous delivery standards achieved, and how adopting the CAICT DevOps model drives faster, higher‑quality software delivery and digital transformation across the industry.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 10 min read
How Leading Chinese Insurers Accelerate Digital Transformation with DevOps Maturity
Efficient Ops
Efficient Ops
Mar 31, 2024 · Operations

How Chinese Banks Accelerate Digital Transformation with DevOps Maturity

The 2023 China DevOps Survey reveals that Chinese banks are increasingly adopting DevOps to boost IT efficiency, with over 100 enterprises completing 351 assessments across multiple standards, showcasing concrete case studies of mobile banking, cloud platforms, and security initiatives that illustrate the practical impact of the CAICT DevOps maturity model.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 17 min read
How Chinese Banks Accelerate Digital Transformation with DevOps Maturity
FunTester
FunTester
Mar 29, 2024 · Operations

Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes

This article describes how WeChat Pay applied chaos engineering to improve system reliability, detailing the business scenario, challenges of controlling fault injection radius, practical solutions, risk assessment, automation, and the resulting business and tool achievements.

Fault InjectionOperationsWeChat Pay
0 likes · 18 min read
Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes
Efficient Ops
Efficient Ops
Mar 28, 2024 · Operations

How Chinese Banks Are Accelerating DevOps Maturity in 2024

The 2024 report details how major Chinese banks have adopted the CAICT DevOps Capability Maturity Model, showcasing assessment results across multiple standards—continuous delivery, technical operations, security, system tools, BizDevOps and efficiency measurement—highlighting improvements in agility, automation and overall IT efficiency.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 19 min read
How Chinese Banks Are Accelerating DevOps Maturity in 2024
Architecture Digest
Architecture Digest
Mar 28, 2024 · Operations

A Comprehensive Overview of Monitoring Systems: Fundamentals, Popular Open‑Source Solutions, and Selection Guidance

This article systematically introduces monitoring fundamentals, core concepts, and architecture, then reviews three widely used open‑source monitoring tools—Zabbix, Open‑Falcon, and Prometheus—detailing their components, advantages, disadvantages, and provides practical advice for selecting the most suitable solution.

Open-FalconOperationsZabbix
0 likes · 17 min read
A Comprehensive Overview of Monitoring Systems: Fundamentals, Popular Open‑Source Solutions, and Selection Guidance
Efficient Ops
Efficient Ops
Mar 27, 2024 · Operations

How Chinese State Banks Accelerate Digital Transformation with DevOps Maturity

The 2023 China DevOps Status Survey reveals that state-owned banks have steadily improved their DevOps maturity, with over 100 enterprises evaluated across 20+ industries, showcasing detailed case studies, assessment results for agile development, continuous delivery, security, and performance standards that illustrate the impact of DevOps on banking digital transformation.

ChinaDevOpsDigital Transformation
0 likes · 24 min read
How Chinese State Banks Accelerate Digital Transformation with DevOps Maturity
Java Architect Essentials
Java Architect Essentials
Mar 27, 2024 · Cloud Native

Mastering Graceful Shutdown in Kubernetes with Spring Boot and Nacos

This article explains the concept of graceful shutdown, walks through a Kubernetes pod termination flow, demonstrates a Spring Boot + Nacos example with PreStop hooks, identifies common pitfalls, and provides practical optimizations—including MQ handling, scheduled tasks, traffic control, and actuator shutdown—to achieve reliable, zero‑downtime service termination.

Cloud NativeGraceful ShutdownKubernetes
0 likes · 12 min read
Mastering Graceful Shutdown in Kubernetes with Spring Boot and Nacos
Efficient Ops
Efficient Ops
Mar 25, 2024 · Operations

Why SRE Exists and How It Solves Modern Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how SRE teams use SLOs, monitoring, and scenario drills to improve system reliability, performance, and observability in complex production environments.

DevOpsOperationsReliability
0 likes · 12 min read
Why SRE Exists and How It Solves Modern Reliability Challenges
Efficient Ops
Efficient Ops
Mar 24, 2024 · Operations

20 Essential Linux Terminal Tricks to Supercharge Your Productivity

This article compiles a set of practical Linux command‑line shortcuts—from tab completion and directory navigation to history search and log monitoring—that help both beginners and seasoned users work faster, avoid common pitfalls, and boost overall terminal productivity.

LinuxOperationsShell
0 likes · 13 min read
20 Essential Linux Terminal Tricks to Supercharge Your Productivity
Alipay Experience Technology
Alipay Experience Technology
Mar 19, 2024 · Big Data

How Alipay Cut Merchant Bill Complexity by 60% Using a Five‑Step Method

This article details how Alipay's data engineering team applied Elon Musk's five‑step work method to completely refactor a decade‑old merchant billing system, reducing overall complexity by over 60%, improving timeliness by an hour, cutting storage and compute costs by a third, and dramatically lowering operational and maintenance burdens.

Big DataCost reductionOperations
0 likes · 23 min read
How Alipay Cut Merchant Bill Complexity by 60% Using a Five‑Step Method
Architect's Guide
Architect's Guide
Mar 19, 2024 · Cloud Native

Graceful Shutdown in Kubernetes with Spring Boot and Nacos: Concepts, Cases, and Optimizations

This article explains the concept of graceful shutdown, demonstrates it with Kubernetes‑SpringBoot‑Nacos case studies, analyzes common issues, and provides optimization strategies such as adjusting terminationGracePeriodSeconds, using PreStop hooks, handling MQ and scheduled tasks, and leveraging actuator shutdown for reliable service termination.

Cloud NativeGraceful ShutdownNacos
0 likes · 10 min read
Graceful Shutdown in Kubernetes with Spring Boot and Nacos: Concepts, Cases, and Optimizations
Baidu Geek Talk
Baidu Geek Talk
Mar 18, 2024 · Industry Insights

How Baidu Ensures Transaction Data Consistency with Real‑Time and Offline Reconciliation

This article examines Baidu's transaction middleware, detailing its multi‑layer architecture, the challenges of high‑volume, multi‑scenario payments, and the design of both near‑real‑time and T+1 offline reconciliation systems that leverage binlog listening, ETL pipelines, and big‑data technologies to guarantee data consistency across dozens of internal and external services.

BaiduData ReconciliationOffline
0 likes · 15 min read
How Baidu Ensures Transaction Data Consistency with Real‑Time and Offline Reconciliation