Tagged articles
3281 articles
Page 10 of 33
FunTester
FunTester
Dec 10, 2023 · Databases

How GitHub Upgraded 1,200 MySQL Servers to 8.0 Without Downtime

GitHub detailed a year‑long, multi‑team effort to upgrade over 1,200 MySQL hosts from 5.7 to 8.0 using phased rollouts, automated testing, compatibility checks, and rollback mechanisms while maintaining strict SLOs and high‑availability requirements.

GitHubOperationsdatabase migration
0 likes · 16 min read
How GitHub Upgraded 1,200 MySQL Servers to 8.0 Without Downtime
DataFunTalk
DataFunTalk
Dec 10, 2023 · Operations

Designing Experiments for Peak Surge Pricing in Two‑Sided Markets: Lessons from Uber, Lyft, DoorDash and Didi

This article examines how two‑sided platforms such as Uber, Lyft, DoorDash and Didi design and evaluate peak‑surcharge experiments, addressing network effects, bias‑variance trade‑offs, time‑space slicing, random‑saturation designs, and continuous bandit‑based testing within an operations‑focused experimental system.

AB testingOperationscausal inference
0 likes · 16 min read
Designing Experiments for Peak Surge Pricing in Two‑Sided Markets: Lessons from Uber, Lyft, DoorDash and Didi
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Dec 10, 2023 · Operations

Comprehensive Guide to Nginx: Architecture, Configuration, and Advanced Features

This extensive tutorial explains Nginx's architecture, installation, directory layout, configuration directives, location matching rules, reverse proxy setup, load balancing strategies, static‑dynamic separation, CORS handling, caching mechanisms, access control lists, rate limiting, HTTPS configuration, compression, and many other essential directives for effective web server and reverse‑proxy management.

ConfigurationNginxOperations
0 likes · 66 min read
Comprehensive Guide to Nginx: Architecture, Configuration, and Advanced Features
DeWu Technology
DeWu Technology
Dec 8, 2023 · Operations

SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services

On November 25, Dewu Technology hosted an SRE Stability Engineering salon in Hangzhou where experts from Alibaba, Tencent, Ant Group and Dewu shared practical insights on C‑end link reliability, Alibaba’s system stability operations, Tencent Game’s cloud‑native SRE practices, and Ant Group’s chaos engineering, concluding with a Q&A and resource distribution.

Cloud NativeOperationsSRE
0 likes · 7 min read
SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services
Su San Talks Tech
Su San Talks Tech
Dec 6, 2023 · Operations

What Went Wrong in Didi’s 12‑Hour Outage? Lessons on Kubernetes Upgrades and Cost‑Cutting

An in‑depth review of Didi’s 12‑hour P0 outage reveals how a mistaken Kubernetes version downgrade during an in‑place upgrade caused master node failure, discusses cluster isolation, upgrade strategies, and the role of cost‑cutting pressures, offering practical lessons for large‑scale operations.

Cluster UpgradeCost ManagementKubernetes
0 likes · 7 min read
What Went Wrong in Didi’s 12‑Hour Outage? Lessons on Kubernetes Upgrades and Cost‑Cutting
Architecture and Beyond
Architecture and Beyond
Dec 2, 2023 · Operations

Postmortem Analysis of the Yuque Service Outage and Lessons on Complex Systems and the KISS Principle

The article reviews the October 23 Yuque service outage, analyzes root causes such as a buggy upgrade tool and outdated storage, extracts operational lessons on testing, disaster recovery, high‑availability, communication, and advocates the KISS principle to simplify complex systems for improved reliability.

Complex SystemsKISS principleOperations
0 likes · 10 min read
Postmortem Analysis of the Yuque Service Outage and Lessons on Complex Systems and the KISS Principle
Open Source Linux
Open Source Linux
Dec 1, 2023 · Operations

10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, suitable scenarios, advantages, and real‑world examples, and includes practical code snippets to help automate, monitor, and manage infrastructure efficiently.

Operationsautomationdevops tools
0 likes · 8 min read
10 Essential Ops Tools Every Engineer Should Master
DevOps
DevOps
Nov 30, 2023 · R&D Management

Comprehensive R&D Efficiency Metrics and Calculation Formulas

This article presents a comprehensive collection of R&D efficiency metrics and their calculation formulas, covering code integration, quality, productivity, reliability, maintainability, and deployment aspects, to help teams evaluate and improve development performance and operational effectiveness.

OperationsR&D metricscode quality
0 likes · 12 min read
Comprehensive R&D Efficiency Metrics and Calculation Formulas
Java Captain
Java Captain
Nov 30, 2023 · Operations

Analysis of Didi's November 2023 System Outage and Potential Technical Causes

The article reviews Didi's late‑November 2023 service disruption, detailing the timeline of failures, official apologies, and expert analyses of six possible technical causes—including software bugs, server issues, third‑party failures, DDoS, other attacks, and ransomware—while highlighting the role of a Kubernetes upgrade and cost‑cutting pressures.

Cloud NativeDidiOperations
0 likes · 7 min read
Analysis of Didi's November 2023 System Outage and Potential Technical Causes
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 30, 2023 · Backend Development

How Alibaba Travel Billing System Achieves 100% Accuracy and Real‑Time Reconciliation

This article details the design, challenges, and monitoring strategies of Alibaba's travel billing system, explaining how a modular backend architecture, multi‑way reconciliation, full‑link monitoring, and a configurable expression engine enable near‑perfect bill accuracy and automated settlement for enterprise customers.

BackendOperationsReconciliation
0 likes · 17 min read
How Alibaba Travel Billing System Achieves 100% Accuracy and Real‑Time Reconciliation
FunTester
FunTester
Nov 28, 2023 · Operations

How to Adopt a DevOps Culture: Custom Strategies, CI/CD, Automation & Metrics

This article outlines the essential steps for embracing DevOps culture, emphasizing tailored strategies, deep understanding of CI/CD, clear role assignments, extensive automation, key performance metrics, and the critical role of quality assurance to achieve faster, reliable software delivery.

CultureDevOpsMetrics
0 likes · 9 min read
How to Adopt a DevOps Culture: Custom Strategies, CI/CD, Automation & Metrics
Bilibili Tech
Bilibili Tech
Nov 28, 2023 · Operations

Technical Assurance Practices for the 13th League of Legends World Championship Live Stream

For the 13th League of Legends World Championship live stream on Bilibili, a comprehensive technical‑assurance framework—covering pre‑event traffic buildup, in‑event experience, and post‑event replay—mapped over 60 business functions, applied a traffic‑estimation model, executed fault‑injection drills, load tests, strict SOPs and change control, and real‑time monitoring, enabling 120 million viewers and a peak of 460 million concurrent users.

Fault InjectionOperationsPerformance Testing
0 likes · 19 min read
Technical Assurance Practices for the 13th League of Legends World Championship Live Stream
Efficient Ops
Efficient Ops
Nov 27, 2023 · Operations

How 19 Leading Chinese Enterprises Accelerated IT Efficiency with the DevOps Maturity Model

This article reviews how nineteen top Chinese companies applied the CAICT‑led DevOps Capability Maturity Model, detailing their assessment results, project improvements, and concrete performance gains such as higher release frequency, full test coverage, and streamlined operations across diverse industry sectors.

Capability Maturity ModelCase StudiesDevOps
0 likes · 10 min read
How 19 Leading Chinese Enterprises Accelerated IT Efficiency with the DevOps Maturity Model
Efficient Ops
Efficient Ops
Nov 27, 2023 · Operations

How 5 Leading Insurers Accelerated Digital Transformation with DevOps Maturity Assessments

This article reviews how five leading Chinese insurance firms evaluated eight projects using the CAICT DevOps Capability Maturity Model, highlighting each company's implementation details, performance improvements, and the broader significance of the model for digital transformation and operational excellence in the insurance sector.

Capability Maturity ModelDevOpsDigital Transformation
0 likes · 10 min read
How 5 Leading Insurers Accelerated Digital Transformation with DevOps Maturity Assessments
Efficient Ops
Efficient Ops
Nov 26, 2023 · Operations

Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability

The article details how Beijing Mobile’s SRE Smart Operations team applied SRE principles, automation, and cloud‑native tools to transform traditional DevOps into a reliable, scalable operation, highlighting their fault‑prevention, monitoring, incident response, and continuous improvement practices that earned them the 2023 IT Technology Leadership award.

OperationsSREautomation
0 likes · 7 min read
Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability
Architecture and Beyond
Architecture and Beyond
Nov 25, 2023 · Operations

Effective Log Management Strategy: Standards, SDK Integration, and Lifecycle Practices

The article outlines common logging problems and presents a comprehensive six‑step strategy—including clear logging standards, systematic standard management, a unified SDK, centralized log management systems, regular standard reviews, and lifecycle deprecation—to transform chaotic logs into a reliable tool that boosts development efficiency.

Log ManagementOperationsSDK
0 likes · 7 min read
Effective Log Management Strategy: Standards, SDK Integration, and Lifecycle Practices
Cloud Native Technology Community
Cloud Native Technology Community
Nov 24, 2023 · Operations

Netflix’s Unique Developer Productivity Platform and Platform Engineering Practices

The article examines Netflix’s platform engineering approach, detailing its centralized team structure, hub‑and‑spoke model, internal customer‑support system, productivity evaluation methods, challenges such as documentation, and ongoing efforts to improve developer experience and platform adoption.

Internal SupportNetflixOperations
0 likes · 10 min read
Netflix’s Unique Developer Productivity Platform and Platform Engineering Practices
dbaplus Community
dbaplus Community
Nov 23, 2023 · Operations

How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples

This article explains why monitoring alert noise harms efficiency, presents metrics such as recall and accuracy, details rule‑based, blacklist/whitelist, ratio‑based, and intelligent noise‑reduction techniques, shares Java code examples, and shows measurable results after applying the governance process.

Alert Noise ReductionOperationsincident management
0 likes · 13 min read
How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples
Qunar Tech Salon
Qunar Tech Salon
Nov 22, 2023 · Operations

Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis

This article details Qunar's comprehensive overhaul of its monitoring platform—introducing second‑level metrics, redesigning storage with VictoriaMetrics, optimizing client and server data collection, and building a root‑cause analysis tool—to dramatically reduce order‑related fault discovery time from minutes to under one minute.

MicroservicesOperationsTSDB
0 likes · 22 min read
Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis
Data Thinking Notes
Data Thinking Notes
Nov 21, 2023 · Operations

36 Essential Data Analysis Models Across 6 Business Domains

This article presents 36 concise data analysis models spanning six key business dimensions—Internet operations, strategy and organization, quality and production, marketing services, financial management, and human resources—to help analysts choose the right method for structured, logical, and effective insights.

Business AnalyticsMarketingOperations
0 likes · 12 min read
36 Essential Data Analysis Models Across 6 Business Domains
Senior Tony
Senior Tony
Nov 21, 2023 · Operations

How to Shrink Failure Scope with Circuit Breakers, Degradation, and Link Splitting

This article explains how to reduce the impact of failures in distributed systems by simplifying service links, applying circuit‑breaker mechanisms, implementing graceful degradation, performing core‑link isolation, and, as a last resort, switching to a minimal MVP version to keep essential functionality alive.

Operationscircuit breakerdegradation
0 likes · 11 min read
How to Shrink Failure Scope with Circuit Breakers, Degradation, and Link Splitting
Architects Research Society
Architects Research Society
Nov 21, 2023 · Operations

Digital Transformation Guide: Definition, Pricing, and Planning (Part 1) – Key Points and Framework for Asset Management

This article defines digital transformation, introduces the Digital Transformation Framework (DTF) and its economic, risk and financial dimensions, and explains how asset‑management firms can redesign front, middle and back‑office functions, adopt composable enterprise models, and align culture, automation and API‑driven strategies to achieve sustainable, disruptive change.

Business strategyDTFDigital Transformation
0 likes · 21 min read
Digital Transformation Guide: Definition, Pricing, and Planning (Part 1) – Key Points and Framework for Asset Management
ITPUB
ITPUB
Nov 17, 2023 · Operations

How Bilibili Overcame a Massive CDN Outage: Cloud‑Edge Incident Response Lessons

This article details the August 2023 Bilibili CDN failure, analyzes its root causes, describes the 1‑5‑10 emergency recovery framework, and presents cloud‑side SLB/BFS optimizations and edge‑side scheduling and fallback strategies that together restored service and improved future resilience.

CDNEdge ComputingOperations
0 likes · 20 min read
How Bilibili Overcame a Massive CDN Outage: Cloud‑Edge Incident Response Lessons
JD Tech
JD Tech
Nov 16, 2023 · Operations

Preparing JD's CDP Platform for Double 11: Challenges, Capacity Planning, and Lessons Learned

This article recounts the author's experience preparing JD's Customer Data Platform (CDP) for the Double 11 shopping festival, detailing the platform's capabilities, business scenarios, capacity planning, stability and performance challenges, disaster‑recovery measures, and personal reflections on the intensive technical effort involved.

Big DataCDPOperations
0 likes · 12 min read
Preparing JD's CDP Platform for Double 11: Challenges, Capacity Planning, and Lessons Learned
Efficient Ops
Efficient Ops
Nov 14, 2023 · Operations

How China’s Top State Banks Accelerate Digital Transformation with DevOps Maturity Models

This article examines how six major Chinese state-owned banks leveraged the CAICT‑led DevOps Capability Maturity Model to assess dozens of projects, improve IT efficiency, integrate resources, and achieve measurable gains in delivery speed, quality, and security across agile development, continuous delivery, operations, and risk management.

BankingContinuous DeliveryDevOps
0 likes · 22 min read
How China’s Top State Banks Accelerate Digital Transformation with DevOps Maturity Models
Baidu Geek Talk
Baidu Geek Talk
Nov 14, 2023 · Industry Insights

How Elastic Cascading Controls Boost Search Engine Compute Efficiency

This article analyzes the rising compute demand in modern deep‑learning‑driven search systems, proposes a micro‑ and macro‑level adaptive power‑allocation framework, models the optimization problem with cost, time, and feasibility constraints, and details an elastic cascading architecture that dynamically balances resource usage, system state, and traffic value to achieve higher ROI and stability.

AIOperationsSystem optimization
0 likes · 14 min read
How Elastic Cascading Controls Boost Search Engine Compute Efficiency
JD Tech
JD Tech
Nov 10, 2023 · Operations

Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices

This article explains the concept and importance of MTTR (Mean Time To Repair), shows how to calculate it, and provides a comprehensive set of monitoring, alerting, rapid mitigation, tool‑assisted analysis, and team coordination techniques to significantly shorten incident resolution time and improve system reliability.

MTTROperationsReliability
0 likes · 26 min read
Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices
Efficient Ops
Efficient Ops
Nov 9, 2023 · Operations

How Everbright Securities Achieved Top‑Tier DevOps Maturity and Boosted Efficiency

Everbright Securities’ Sunshine E‑Office project passed the CAICT DevOps Continuous Delivery Level‑3 assessment, showcasing how standardized DevOps practices and tool empowerment can dramatically improve development efficiency, quality, and security, while driving digital transformation across the financial sector.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 9 min read
How Everbright Securities Achieved Top‑Tier DevOps Maturity and Boosted Efficiency
Efficient Ops
Efficient Ops
Nov 7, 2023 · Operations

How CICC Wealth Reached Advanced DevOps Operations Standards

At the 2023 GOPS Global Operations Conference in Shanghai, China Information Communication Research Institute announced that CICC Wealth's unified access and authentication project passed the DevOps Technical Operations Level‑2 assessment, showcasing how standardized DevOps practices and tool empowerment can dramatically improve quality, efficiency, and market competitiveness in the financial sector.

Continuous DeliveryDevOpsFinancial Services
0 likes · 13 min read
How CICC Wealth Reached Advanced DevOps Operations Standards
Efficient Ops
Efficient Ops
Nov 7, 2023 · Operations

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.

ErrorBudgetMTBFMTTR
0 likes · 14 min read
Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability
Architect's Guide
Architect's Guide
Nov 6, 2023 · Operations

Comparison of Prometheus and Zabbix Monitoring Tools

This article compares the open‑source monitoring solutions Prometheus and Zabbix, outlining their histories, architectures, data collection methods, scalability, storage models, configuration complexity, community activity, and suitability for different environments such as traditional servers versus cloud‑native container platforms.

Cloud NativeOperationsPrometheus
0 likes · 8 min read
Comparison of Prometheus and Zabbix Monitoring Tools
NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
Nov 3, 2023 · Operations

Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling

The article outlines how game QA and third‑party providers can improve cooperation by aligning basic performance concepts such as TPS, QPS and concurrency, selecting appropriate rate‑limiting strategies, establishing precise monitoring and alerting, and preparing clear incident‑response and delivery standards.

OperationsPerformance Testingmonitoring
0 likes · 15 min read
Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling
Efficient Ops
Efficient Ops
Nov 2, 2023 · Operations

What Do China’s Top Banks Reveal About DevOps Maturity? Insights from GOPS 2023

The 21st GOPS Global Operations Conference in Shanghai unveiled the latest DevOps capability maturity assessment results from major Chinese banks, highlighting pioneering evaluations, detailed improvements across configuration, monitoring, and user experience, and introducing the comprehensive DevOps maturity model that guides digital transformation.

BankingChinaDevOps
0 likes · 10 min read
What Do China’s Top Banks Reveal About DevOps Maturity? Insights from GOPS 2023
Efficient Ops
Efficient Ops
Nov 2, 2023 · Operations

How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation

The Industrial and Commercial Bank of China software development center created an SRE panoramic monitoring view system that unifies data channels, standardizes metrics, offers multi‑dimensional dashboards, and introduces an intelligent Ops Assistant, dramatically improving fault detection, response speed, and cross‑team operational efficiency.

Digital TransformationICBCOperations
0 likes · 6 min read
How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation
AntTech
AntTech
Nov 2, 2023 · Cloud Native

AI and Cloud‑Native Enhancements for Ant Group’s Consumer Credit Technology Platform

The article describes how Ant Group’s consumer credit technology platform leverages AI and cloud‑native architectures to achieve ultra‑fast operations, precise fund verification, large‑scale simulation, and seamless migration for dozens of financial institutions, addressing the massive technical challenges of internet‑scale credit services.

AICloud NativeDevOps
0 likes · 9 min read
AI and Cloud‑Native Enhancements for Ant Group’s Consumer Credit Technology Platform
Efficient Ops
Efficient Ops
Nov 1, 2023 · Operations

How China Merchants Securities Achieved Top DevOps Maturity: A Deep Dive

At the 2023 GOPS Global Operations Conference, China Merchants Securities showcased its successful DevOps assessments—earning multiple Level 3 ratings in agile development and continuous delivery—demonstrating how standardized processes, tool empowerment, and a unified digital management system can accelerate digital transformation and boost market competitiveness.

Agile DevelopmentContinuous DeliveryDevOps
0 likes · 14 min read
How China Merchants Securities Achieved Top DevOps Maturity: A Deep Dive
Efficient Ops
Efficient Ops
Oct 31, 2023 · Operations

How Zhengzhou Exchange’s Tech Team Earned Top‑Tier DevOps Level‑3 Delivery

In a detailed Q&A, Zhengzhou Yisheng Information Technology shares how its two exchange‑focused platforms achieved the DevOps Continuous Delivery Level‑3 assessment, highlighting process improvements, metric gains, architectural choices, challenges overcome, and future plans for broader digital transformation.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 12 min read
How Zhengzhou Exchange’s Tech Team Earned Top‑Tier DevOps Level‑3 Delivery
Continuous Delivery 2.0
Continuous Delivery 2.0
Oct 31, 2023 · Operations

Understanding Platform Engineering: Goals, Practices, and Differences from Traditional DevOps

The article explains platform engineering as an evolution of DevOps that emphasizes internal developer platforms, clear organizational goals, reduced friction for engineers, and practical, incremental solutions rather than over‑reliance on complex tools, highlighting its rising popularity and distinct approach.

Internal Developer PlatformOperationssoftware delivery
0 likes · 6 min read
Understanding Platform Engineering: Goals, Practices, and Differences from Traditional DevOps
Efficient Ops
Efficient Ops
Oct 30, 2023 · Operations

How Chinese Financial Firms Are Raising Their DevOps Maturity

The 21st GOPS Global Operations Conference in Shanghai unveiled the latest DevOps capability maturity assessment results, highlighting first‑time evaluations across exchanges, securities and fund companies, detailed improvements in delivery speed, test coverage and automation, and introducing the comprehensive DevOps standards now adopted industry‑wide.

Capability MaturityContinuous DeliveryDevOps
0 likes · 12 min read
How Chinese Financial Firms Are Raising Their DevOps Maturity
Efficient Ops
Efficient Ops
Oct 30, 2023 · Operations

How China’s Leading Banks Are Raising the Bar with DevOps Maturity Assessments

The 21st GOPS Global Operations Conference in Shanghai unveiled the latest DevOps capability maturity assessment results, highlighting how major Chinese banks and financial institutions have adopted DevOps standards to improve technology operations, agile development, security, and system tooling across multiple projects.

BankingContinuous DeliveryDevOps
0 likes · 11 min read
How China’s Leading Banks Are Raising the Bar with DevOps Maturity Assessments
Architecture and Beyond
Architecture and Beyond
Oct 29, 2023 · Operations

Postmortem of the October 23 Yuque Service Outage: Lessons on Complex Systems and the KISS Principle

The October 23 Yuque outage, caused by a buggy upgrade tool and outdated storage hardware, highlighted the importance of thorough testing, robust disaster‑recovery, high‑availability architecture, clear communication, continuous learning, and applying the KISS principle to simplify complex systems and improve operational stability.

Complex SystemsKISS principleOperations
0 likes · 10 min read
Postmortem of the October 23 Yuque Service Outage: Lessons on Complex Systems and the KISS Principle
Efficient Ops
Efficient Ops
Oct 27, 2023 · Operations

How Dongguan Securities Achieved Top‑Tier DevOps Maturity: A Detailed Case Study

Dongguan Securities’ ‘Zhangzhengbao Service Platform’ passed the China Information and Communication Research Institute’s DevOps Capability Maturity Model Level‑3 Continuous Delivery assessment, showcasing how standardized DevOps practices, automated pipelines, and micro‑service architecture dramatically improved development efficiency, code quality, and operational agility, positioning the firm as a domestic leader in financial technology.

Continuous DeliveryDevOpsOperations
0 likes · 12 min read
How Dongguan Securities Achieved Top‑Tier DevOps Maturity: A Detailed Case Study
Efficient Ops
Efficient Ops
Oct 26, 2023 · Operations

How Zhejiang Rural Commercial Bank Reached Level‑3 DevOps Maturity and Boosted Efficiency

Zhejiang Rural Commercial Bank’s two flagship projects passed the Level‑3 Continuous Delivery assessment, showcasing how standardized DevOps practices, tool empowerment, and agile transformation can dramatically improve software quality, delivery speed, and overall competitiveness in the banking sector.

Banking TechnologyContinuous DeliveryDevOps
0 likes · 12 min read
How Zhejiang Rural Commercial Bank Reached Level‑3 DevOps Maturity and Boosted Efficiency
Efficient Ops
Efficient Ops
Oct 26, 2023 · Operations

How Zhengzhou Commodity Exchange’s Tech Team Earned Top‑Tier DevOps Continuous Delivery Certification

This article details Zhengzhou Commodity Exchange’s technology company's successful DevOps continuous‑delivery Level 3 assessment, sharing interview insights, performance metrics, implementation challenges, and future plans that illustrate how standardized, automated DevOps practices boost delivery speed, quality, and digital transformation.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 13 min read
How Zhengzhou Commodity Exchange’s Tech Team Earned Top‑Tier DevOps Continuous Delivery Certification
Architects Research Society
Architects Research Society
Oct 25, 2023 · Operations

eBay’s Scalability Best Practices: Functional Partitioning, Horizontal Sharding, Avoiding Distributed Transactions, Asynchronous Decoupling, Caching, and Virtualization

The article outlines eBay’s key scalability best practices—including functional partitioning, horizontal sharding, eliminating distributed transactions, aggressive asynchronous decoupling, intelligent caching, and pervasive virtualization—to illustrate how large‑scale web systems can achieve linear or sub‑linear growth while maintaining availability and performance.

OperationsScalabilityasynchronous processing
0 likes · 14 min read
eBay’s Scalability Best Practices: Functional Partitioning, Horizontal Sharding, Avoiding Distributed Transactions, Asynchronous Decoupling, Caching, and Virtualization
Efficient Ops
Efficient Ops
Oct 22, 2023 · Operations

Master Loki: Deploy, Configure, and Query Logs Efficiently

This guide explains Loki's core concepts, deployment steps for Promtail and Loki, Grafana integration, label‑based indexing, handling dynamic and high‑cardinality tags, and query optimization techniques, providing a complete roadmap for building a cost‑effective, scalable log aggregation system.

GrafanaKubernetesLoki
0 likes · 15 min read
Master Loki: Deploy, Configure, and Query Logs Efficiently

How Transparent AI Boosts Trust in AIOps: Explainable Root‑Cause Solutions

This article examines the rapid growth of the Chinese IT operations market, explains why AIOps faces trust challenges due to opaque deep‑learning models, and presents AsiaInfo's transparent‑model and post‑hoc explanation engine together with three concrete explainable root‑cause analysis methods, concluding with future outlooks for trustworthy AIOps.

AI trustOperationsRoot Cause Analysis
0 likes · 13 min read
How Transparent AI Boosts Trust in AIOps: Explainable Root‑Cause Solutions
Efficient Ops
Efficient Ops
Oct 15, 2023 · Databases

How to Diagnose and Fix Slow Redis Responses: A Step-by-Step Guide

This article walks through practical methods for troubleshooting slow service alerts, diagnosing Redis performance bottlenecks, and reproducing issues with local demos and load simulations, offering concrete metrics, command‑line checks, and mitigation strategies such as scaling, rate‑limiting, and pipeline optimization.

Operationsmonitoringperformance
0 likes · 22 min read
How to Diagnose and Fix Slow Redis Responses: A Step-by-Step Guide
Alibaba Cloud Native
Alibaba Cloud Native
Oct 10, 2023 · Operations

Mastering Memcached: Features, Use Cases, and Prometheus Monitoring

This article explains Memcached’s architecture, key characteristics, suitable and unsuitable scenarios, memory management and LRU mechanisms, version details, and provides a comprehensive guide to monitoring its performance and health using Prometheus and Alibaba Cloud ARMS dashboards.

Cloud NativeMemcachedOperations
0 likes · 26 min read
Mastering Memcached: Features, Use Cases, and Prometheus Monitoring
Efficient Ops
Efficient Ops
Oct 9, 2023 · Cloud Native

Why Do Kubernetes Pods Get Stuck? Decoding Common Pod Status Errors

Learn how to diagnose and resolve frequent Kubernetes pod status issues such as ContainerCreating, ErrImagePull, Pending, CrashLoopBackOff, and UnexpectedAdmissionError by examining Docker services, storage mounts, ConfigMaps, image repositories, and node resources, with practical examples and command‑line solutions.

ConfigMapContainerCreatingErrImagePull
0 likes · 9 min read
Why Do Kubernetes Pods Get Stuck? Decoding Common Pod Status Errors
FunTester
FunTester
Oct 8, 2023 · Operations

Why Writing Test Cases Saves Time and Boosts Software Quality

This article explains how systematic test case creation, clear verification steps, and disciplined bug tracking improve testing efficiency, ensure comprehensive coverage of core system logic and data flows, and help teams identify root causes to continuously raise software quality despite limited resources.

OperationsSoftware Testingbug tracking
0 likes · 10 min read
Why Writing Test Cases Saves Time and Boosts Software Quality
dbaplus Community
dbaplus Community
Oct 7, 2023 · Operations

How to Build a Truly High‑Availability System: 6 Essential Design Layers

This article breaks down high‑availability system design into six critical layers—architecture, development standards, application services, storage, product safeguards, and operations—offering concrete practices such as capacity planning, fault‑tolerant patterns, monitoring, and incident‑response strategies to achieve four‑nine (99.99%) uptime.

OperationsSystem Designcapacity planning
0 likes · 26 min read
How to Build a Truly High‑Availability System: 6 Essential Design Layers
Efficient Ops
Efficient Ops
Oct 7, 2023 · Operations

How ICBC Fund’s DevOps‑Driven Trading Platform Powers Digital Transformation

The 2023 China International Service Trade Fair’s Enterprise Digital Transformation Forum highlighted ICBC Fund’s award‑winning DevOps‑based trading management platform, showcasing how a high‑cohesion, low‑coupling architecture and dual mid‑platform foundation enable online operations, risk compliance, and industry‑first DevOps certification.

DevOpsDigital TransformationICBC Fund
0 likes · 7 min read
How ICBC Fund’s DevOps‑Driven Trading Platform Powers Digital Transformation
Qunar Tech Salon
Qunar Tech Salon
Sep 28, 2023 · Operations

Automated Root Cause Analysis for Flight Ticket Transaction Interception at Qunar: Design, Algorithm, and Performance Optimizations

This article describes how Qunar implemented an automated root‑cause analysis system for flight‑ticket transaction interception, detailing the problem background, system research, a custom algorithm focusing on explanatory power, performance optimizations that reduced analysis time from five minutes to under ten seconds, and the resulting operational improvements.

Operationsalgorithmroot-cause analysis
0 likes · 13 min read
Automated Root Cause Analysis for Flight Ticket Transaction Interception at Qunar: Design, Algorithm, and Performance Optimizations
Efficient Ops
Efficient Ops
Sep 25, 2023 · Operations

Master Ansible: Architecture, Workflow, and the Seven Essential Commands

This article introduces Ansible as an open‑source automation tool, explains its core architecture and workflow, and provides detailed usage of its seven primary commands with examples, helping readers quickly grasp how to configure, deploy, and manage systems efficiently.

AnsibleConfiguration ManagementDevOps
0 likes · 8 min read
Master Ansible: Architecture, Workflow, and the Seven Essential Commands
Efficient Ops
Efficient Ops
Sep 24, 2023 · Big Data

Mastering Kafka: From Basics to Advanced Operations and Performance Tuning

This article provides a comprehensive overview of Apache Kafka, covering its architecture, core concepts such as topics, partitions, and replicas, common operational commands, and practical performance‑tuning tips for high‑throughput, low‑latency streaming workloads.

Distributed SystemsKafkaOperations
0 likes · 23 min read
Mastering Kafka: From Basics to Advanced Operations and Performance Tuning
MaGe Linux Operations
MaGe Linux Operations
Sep 23, 2023 · Cloud Native

Top 8 Docker Monitoring Tools to Boost Container Visibility

This article reviews eight popular Docker monitoring solutions, detailing their key features such as performance metrics, dashboards, alerting, capacity planning, log analysis, and ease of setup, helping you choose the right tool for container observability.

Cloud NativeDockerOperations
0 likes · 8 min read
Top 8 Docker Monitoring Tools to Boost Container Visibility
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 21, 2023 · Operations

Scaling DevOps in Large Organizations: Normalization, Standardization, and Platformization

The article outlines how organizations over a hundred engineers must go beyond merely copying DevOps practices by adopting three progressive steps—normalization, standardization, and platformization—to achieve measurable, scalable efficiency, and concludes with a promotional notice for a Python‑based continuous deployment training course.

OperationsPlatformizationnormalization
0 likes · 8 min read
Scaling DevOps in Large Organizations: Normalization, Standardization, and Platformization
DataFunSummit
DataFunSummit
Sep 20, 2023 · Operations

Applying Intelligent Supply Chain in the Pharmaceutical Industry: Development, Digital Opportunities, Data Monetization, and AI‑Driven Growth

This article explains how intelligent supply‑chain concepts can be applied to the pharmaceutical sector by outlining the industry's supply‑chain structure, digital transformation opportunities and challenges, data‑capability monetization models, and the use of AI and knowledge‑graph technologies to capture growth opportunities.

Data MonetizationOperationspharmaceutical supply chain
0 likes · 16 min read
Applying Intelligent Supply Chain in the Pharmaceutical Industry: Development, Digital Opportunities, Data Monetization, and AI‑Driven Growth
Bilibili Tech
Bilibili Tech
Sep 19, 2023 · Operations

Server System Environment Baseline Management: Declarative Configuration, Multi‑OS Adaptation, Group Management, and Gray‑Release

The document proposes a declarative, multi‑OS baseline management platform that groups servers, supports gray‑release rollouts, monitors state, and automatically restores configurations, extending open‑source tools to provide versioned, conditional, and auditable system‑environment control across a large‑scale infrastructure.

BaselineConfiguration ManagementDeclarative Configuration
0 likes · 13 min read
Server System Environment Baseline Management: Declarative Configuration, Multi‑OS Adaptation, Group Management, and Gray‑Release
Architect
Architect
Sep 16, 2023 · Operations

Common Production Failures and Their Handling Procedures

This article outlines the most common production failures—including network, server, database, software bugs, security vulnerabilities, storage, configuration errors, and third‑party service issues—and provides detailed steps for detection, investigation, and resolution to ensure system stability and reliability.

Operationsincident managementproduction
0 likes · 28 min read
Common Production Failures and Their Handling Procedures
Huolala Tech
Huolala Tech
Sep 14, 2023 · Operations

Designing an Effective UI for Monitoring Alerts: Insights from Huolala

This article shares Huolala's experience designing a unified monitoring platform UI, covering the evolution from open‑source dashboards to a fully self‑developed solution, simplification of PromQL, computed metrics, log and trace integration, and the challenges of alert configuration and visualization.

AlertingOperationsPrometheus
0 likes · 16 min read
Designing an Effective UI for Monitoring Alerts: Insights from Huolala
vivo Internet Technology
vivo Internet Technology
Sep 13, 2023 · Operations

Network Quality Monitoring Center: Architecture, Design, and Implementation for Large-Scale Data Center Latency Measurement

The Network Quality Monitoring Center is a large‑scale system that deploys lightweight agents on every server to issue coordinated ICMP ping probes, a controller to generate and distribute topology‑aware PingLists, and a storage‑analysis module that aggregates latency and loss data for real‑time visualization, alerting and troubleshooting, while addressing load‑balance, ingestion concurrency, and future extensions such as UDP/TCP probes.

Distributed SystemsICMP pingNetwork Monitoring
0 likes · 12 min read
Network Quality Monitoring Center: Architecture, Design, and Implementation for Large-Scale Data Center Latency Measurement
JD Cloud Developers
JD Cloud Developers
Sep 13, 2023 · Operations

Stability Engineering Explained: From Entropy Theory to Practical SRE

The article explores why building system stability is crucial by linking entropy theory to software reliability, introduces the availability formula, discusses common pitfalls and industry practices, and proposes a three‑stage governance framework—prevention, mitigation, and post‑mortem—to systematically improve operational resilience.

AvailabilityOperationsReliability
0 likes · 13 min read
Stability Engineering Explained: From Entropy Theory to Practical SRE
JD Tech
JD Tech
Sep 13, 2023 · Operations

Understanding Disappearing Exception Stacks and Fast Throw Optimization in Java

During system development and operations, missing exception stack traces can hinder troubleshooting; this article explains how JIT compiler optimizations like Fast Throw cause stack traces to disappear, outlines conditions for Fast Throw, shows performance impact, and demonstrates how to locate root causes using logs and metrics.

Fast ThrowJITOperations
0 likes · 10 min read
Understanding Disappearing Exception Stacks and Fast Throw Optimization in Java
Ximalaya Technology Team
Ximalaya Technology Team
Sep 13, 2023 · Operations

Cache Instance Failure Incident Analysis and Root Cause Investigation

During a night‑time outage, a XCache (Codis + Pika) instance hung due to massive write load triggering low‑level protection, causing Sentinel to switch masters; the proxy’s accept queue filled with timed‑out sockets, blocking new connections, so scaling the proxy layer and expanding capacity restored service while prompting automation, health‑check, and queue‑overflow alerts.

CacheIncidentOperations
0 likes · 7 min read
Cache Instance Failure Incident Analysis and Root Cause Investigation
Liangxu Linux
Liangxu Linux
Sep 12, 2023 · Operations

Master Linux User and Group Management: Commands, Files, and Best Practices

This guide explains how Linux stores user and group information in /etc/passwd, /etc/shadow, /etc/group, and /etc/gshadow, and provides detailed usage of commands such as useradd, usermod, userdel, groupadd, groupmod, and gpasswd for creating, modifying, locking, and deleting accounts and groups.

LinuxOperationsSysadmin
0 likes · 16 min read
Master Linux User and Group Management: Commands, Files, and Best Practices
MaGe Linux Operations
MaGe Linux Operations
Sep 8, 2023 · Cloud Native

Master Real-Time Kubernetes Log Viewing with Kubetail and Stern

Learn how to efficiently monitor multiple Kubernetes pods by installing and using two lightweight, real‑time log aggregation tools—Kubetail and Stern—including installation steps for Homebrew, Linux, and Zsh, command‑line options, color output, and practical usage examples.

Cloud NativeKubernetesLog Monitoring
0 likes · 12 min read
Master Real-Time Kubernetes Log Viewing with Kubetail and Stern
macrozheng
macrozheng
Sep 5, 2023 · Operations

How to Manage Linux, MySQL, Redis, and MongoDB with the Web Tool Mayfly-go

This article introduces the open‑source web platform Mayfly-go, explains its key features for Linux system, MySQL, PostgreSQL, Redis, and MongoDB management, provides step‑by‑step installation and configuration instructions, and demonstrates how to use its project, machine, database, and system administration capabilities.

OperationsWeb Managementdatabase
0 likes · 8 min read
How to Manage Linux, MySQL, Redis, and MongoDB with the Web Tool Mayfly-go
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 1, 2023 · Operations

Project Health Metrics and Practices in Google’s SRE and Development Process

The article explains how Google measures and improves software quality before release by separating development and operations responsibilities, using monorepo and trunk‑based development, daily release candidates, automated testing, performance benchmarks, and a comprehensive Project Health (pH) metric system that balances speed, reliability, and quality.

GoogleMetricsOperations
0 likes · 11 min read
Project Health Metrics and Practices in Google’s SRE and Development Process
Efficient Ops
Efficient Ops
Aug 30, 2023 · Operations

How New Oriental Built a Scalable DevOps Platform to Cut Costs and Boost Security

New Oriental’s recent DevOps transformation details how the company tackled siloed platforms, built a unified service‑tree‑driven infrastructure, created a real‑time data processing platform, and implemented comprehensive security measures—including red‑blue exercises, penetration testing, sensitive data monitoring, and CA/KMS—to boost efficiency and reduce costs.

Cost reductionData PlatformDevOps
0 likes · 7 min read
How New Oriental Built a Scalable DevOps Platform to Cut Costs and Boost Security
Baidu Geek Talk
Baidu Geek Talk
Aug 30, 2023 · Industry Insights

Midgard: Adaptive Storage Management for Search – From Simple Tables to Intelligent Layers

This article examines how Baidu's search service evolved its storage architecture—from a basic key‑value table to a hybrid HDD/Redis cache and finally to a sharded, multi‑collection design—culminating in Midgard, an intelligent storage‑layer manager that abstracts and optimizes data access for changing business needs.

BackendData ManagementMidgard
0 likes · 11 min read
Midgard: Adaptive Storage Management for Search – From Simple Tables to Intelligent Layers
MaGe Linux Operations
MaGe Linux Operations
Aug 29, 2023 · Operations

How to Effectively Monitor and Recover a Kafka Cluster

This guide explains essential Kafka monitoring techniques, third‑party tools, custom scripts, key metrics, and practical strategies for high availability, fault detection, rapid recovery, and ongoing testing to keep Kafka clusters stable and performant.

Operationsdistributed-systemsfault tolerance
0 likes · 7 min read
How to Effectively Monitor and Recover a Kafka Cluster