Tagged articles

Operations

3329 articles · Page 18 of 34

Sep 28, 2021 · Operations

Common DBLE Operational Commands for Monitoring, Diagnosis, and Maintenance

This article provides a comprehensive guide to DBLE's built‑in commands for viewing system information, diagnosing faults, and performing maintenance tasks such as killing connections, reloading configurations, and managing sharding nodes, helping MySQL DBAs efficiently operate distributed database clusters.

DBLEMySQLOperations

0 likes · 8 min read

Common DBLE Operational Commands for Monitoring, Diagnosis, and Maintenance

Open Source Linux

Sep 27, 2021 · Operations

Step-by-Step Guide to Installing Zabbix 5 on CentOS 7

This article provides a comprehensive, hands‑on tutorial for installing and configuring Zabbix 5 on CentOS 7, covering system overview, key terminology, disabling SELinux and firewalls, setting up repositories, installing server, agent, frontend, MariaDB, database initialization, configuration tweaks, and final web‑UI setup.

CentOSInstallationOperations

0 likes · 9 min read

Step-by-Step Guide to Installing Zabbix 5 on CentOS 7

Programmer DD

Sep 27, 2021 · Operations

How a Rural County Built China’s Dominant Copy‑Printing Empire

This article traces the emergence and evolution of Newhua County’s copy‑printing industry—from 1960s typewriter repairs to a nationwide network of repair shops, second‑hand markets, and equipment manufacturing—highlighting its social roots, ladder‑style development, research methods, key findings, and lasting impact on China’s office‑equipment sector.

ChinaNewhuaOperations

0 likes · 25 min read

How a Rural County Built China’s Dominant Copy‑Printing Empire

Ops Development Stories

Sep 27, 2021 · Cloud Native

Mastering Kubernetes Liveness, Readiness, and Startup Probes: A Hands‑On Guide

This article explains how to configure Kubernetes liveness, readiness, and startup probes using exec, HTTP, and TCP checks, demonstrates practical YAML examples, shows how probes affect pod lifecycle events, and provides best‑practice recommendations to avoid common pitfalls.

KubernetesOperationsReadiness Probe

0 likes · 15 min read

Mastering Kubernetes Liveness, Readiness, and Startup Probes: A Hands‑On Guide

Efficient Ops

Sep 23, 2021 · Operations

Why Did Our New Deployment Crash? Uncovering Metaspace‑Induced Full‑GC

The article recounts a staged rollout of the Maybach service on elastic cloud, details the timeline of successful and failing deployments, analyzes JVM metrics revealing excessive Metaspace usage that triggered continuous full garbage collections, and explains how this caused system‑wide timeouts and a half‑hour outage.

JVMMetaspaceOperations

0 likes · 10 min read

Why Did Our New Deployment Crash? Uncovering Metaspace‑Induced Full‑GC

Efficient Ops

Sep 23, 2021 · Operations

How Leading Chinese Insurers Achieved DevOps Maturity: Case Studies and Insights

This article examines how three major Chinese insurance firms applied the CAICT DevOps Capability Maturity Model to improve IT efficiency, integrate teams, and accelerate continuous delivery, highlighting architectural innovations, cloud adoption, and measurable performance gains across distributed core systems, e‑commerce platforms, and agricultural claims solutions.

Case StudyInsuranceMaturity Model

0 likes · 9 min read

How Leading Chinese Insurers Achieved DevOps Maturity: Case Studies and Insights

Liangxu Linux

Sep 22, 2021 · Cloud Native

Master Dockerfile: Complete Guide to All Instructions and Best Practices

This article provides a comprehensive, step‑by‑step explanation of every Dockerfile instruction—including variables, FROM, RUN, CMD, LABEL, EXPOSE, ENV, ARG, ADD, COPY, ENTRYPOINT, VOLUME, STOPSIGNAL, HEALTHCHECK, SHELL, WORKDIR, and USER—along with syntax details, usage tips, and practical code examples for building efficient container images.

DockerImage BuildOperations

0 likes · 12 min read

Master Dockerfile: Complete Guide to All Instructions and Best Practices

Efficient Ops

Sep 22, 2021 · Operations

Master Advanced kubectl Tricks: Debug, Filter, and Automate Kubernetes Pods

This article shares a collection of powerful kubectl commands and techniques—including API debugging, status‑based pod filtering and deletion, node‑specific pod listing, pod distribution statistics, and proxy usage—to help Kubernetes operators work more efficiently and avoid manual API scripting.

CLIOperationsdevops

0 likes · 7 min read

Master Advanced kubectl Tricks: Debug, Filter, and Automate Kubernetes Pods

DevOps Cloud Academy

Sep 21, 2021 · Operations

Practical Elasticsearch Operations and Performance Tuning Guide

This article extends previous Elasticsearch cheat sheets with practical commands and step‑by‑step instructions for shard allocation, replica adjustment, cluster settings, slow‑log configuration, mapping routing, force merge, bulk writes, refresh intervals, translog durability, heap sizing, disk‑space monitoring, and troubleshooting strategies.

ElasticsearchOperationsPerformance Tuning

0 likes · 7 min read

Practical Elasticsearch Operations and Performance Tuning Guide

dbaplus Community

Sep 17, 2021 · Operations

Essential Ops Lessons: Prevent Data Loss, Secure Servers, and Optimize Performance

Drawing from three and a half years of system administration, this article outlines practical guidelines for safe online operations, data protection, server security, continuous monitoring, performance tuning, and the right mindset to avoid costly mishaps in production environments.

OperationsPerformance Tuningbackup

0 likes · 12 min read

Essential Ops Lessons: Prevent Data Loss, Secure Servers, and Optimize Performance

Efficient Ops

Sep 16, 2021 · Operations

How Chinese Banks Are Accelerating Digital Transformation with DevOps Maturity

This article reviews the China Academy of Information and Communications Technology's DevOps Capability Maturity Model, shows how major state‑owned banks have participated in 39 assessments, and presents detailed case studies illustrating each bank's DevOps adoption, challenges, and outcomes.

Capability Maturity ModelCase StudyOperations

0 likes · 11 min read

How Chinese Banks Are Accelerating Digital Transformation with DevOps Maturity

Efficient Ops

Sep 15, 2021 · Operations

How China’s Telecom Giants Accelerate Efficiency with the DevOps Maturity Model

This article details how leading Chinese telecom operators have adopted the CAICT‑led DevOps Capability Maturity Model, evaluating 17 projects across multiple companies to improve IT efficiency, integrate resources, and support business systems, showcasing concrete performance gains and best‑practice insights.

Maturity ModelOperationsTelecom

0 likes · 15 min read

How China’s Telecom Giants Accelerate Efficiency with the DevOps Maturity Model

Java Architect Essentials

Sep 14, 2021 · Operations

Graceful Service Startup and Shutdown for Microservices with Spring Boot and Docker

This article explains how to implement graceful shutdown and startup for microservices using JVM shutdown hooks, Spring Boot's built‑in mechanisms, Docker stop signals, and external containers like Jetty, providing code examples and best‑practice recommendations for ensuring services deregister, reject traffic, and start only after health checks succeed.

DockerGracefulShutdownOperations

0 likes · 10 min read

Graceful Service Startup and Shutdown for Microservices with Spring Boot and Docker

Efficient Ops

Sep 14, 2021 · Operations

How China’s Leading Banks Achieve DevOps Maturity: Real‑World Case Studies

This article examines how major Chinese state‑owned banks applied the CAICT DevOps Capability Maturity Model to improve IT efficiency, integrate resources, and support business systems, detailing assessment numbers, project implementations, challenges, and outcomes across continuous delivery, security, and toolchain standards.

Case StudyMaturity ModelOperations

0 likes · 14 min read

How China’s Leading Banks Achieve DevOps Maturity: Real‑World Case Studies

Efficient Ops

Sep 13, 2021 · Operations

How Chinese Banks Accelerate Digital Transformation with the DevOps Maturity Model

This article outlines how major Chinese banks have adopted the CAICT-led DevOps Capability Maturity Model, presenting participation numbers, describing the standard’s development and global significance, and providing contact details for further engagement.

ChinaMaturity ModelOperations

0 likes · 5 min read

How Chinese Banks Accelerate Digital Transformation with the DevOps Maturity Model

Architect's Alchemy Furnace

Sep 11, 2021 · Operations

Mastering Arthas: A Practical Guide to Java Runtime Debugging and Monitoring

This article introduces Arthas, a Java online diagnostic tool, explains its instrumentation‑based runtime principle, guides installation on various platforms, and provides a comprehensive command reference—including basic, system, class, and enhancement commands—for effective debugging, monitoring, and performance analysis of Java applications.

ArthasInstrumentationJava

0 likes · 10 min read

Mastering Arthas: A Practical Guide to Java Runtime Debugging and Monitoring

Alibaba Terminal Technology

Sep 10, 2021 · Mobile Development

How Taobao Overhauled Mobile Diagnostics to Achieve 5‑15‑60 SLA

Taobao redesigned its mobile client’s diagnostics and logging architecture—introducing scenario‑based monitoring, standardized log protocols, snapshot collection, and change‑tracking SDKs—to meet a 5‑minute response, 15‑minute identification, and 60‑minute recovery goal, dramatically improving issue detection, analysis, and resolution efficiency.

Operationsclient-sidelog system

0 likes · 17 min read

How Taobao Overhauled Mobile Diagnostics to Achieve 5‑15‑60 SLA

Aikesheng Open Source Community

Sep 9, 2021 · Databases

Analyzing a MySQL Crash Bug via Error Logs and GDB to Locate the Fixed Issue in MySQL 8.0.20

The article demonstrates how to use MySQL error logs and gdb to trace a crash‑related bug, identify the affected function in the source code, compare version histories on GitHub, and confirm that the issue was fixed in MySQL 8.0.20.

Bug FixMySQLOperations

0 likes · 5 min read

Analyzing a MySQL Crash Bug via Error Logs and GDB to Locate the Fixed Issue in MySQL 8.0.20

Efficient Ops

Sep 9, 2021 · Operations

How a Chinese Consumer Finance Firm Boosted Efficiency with DevOps – Level‑3 Assessment

In a detailed interview, Henan Zhongyuan Consumer Finance explains how its new generation consumer loan system achieved the industry‑first Level‑3 DevOps continuous delivery assessment, highlighting the standards, tools, performance metrics, challenges overcome, and future plans that together illustrate the transformative power of standardized DevOps practices.

Case StudyOperationsSoftware engineering

0 likes · 12 min read

How a Chinese Consumer Finance Firm Boosted Efficiency with DevOps – Level‑3 Assessment

Efficient Ops

Sep 9, 2021 · Operations

How CITIC Securities Boosted Efficiency with DevOps: A Deep Dive into Their Level‑3 Assessment

CITIC Securities’ CIO Xiao Gang discusses how their outsourced service platform achieved Level‑3 DevOps continuous delivery assessment, detailing the motivations, implementation challenges, measurable improvements, and future plans, while highlighting the broader significance of the national DevOps maturity model for the financial sector.

Maturity ModelOperationsSoftware engineering

0 likes · 11 min read

How CITIC Securities Boosted Efficiency with DevOps: A Deep Dive into Their Level‑3 Assessment

Efficient Ops

Sep 9, 2021 · Operations

How Haitong Securities Boosted Efficiency with DevOps Standard Evaluation

The interview reveals how Haitong Securities leveraged the national DevOps standard assessment to transform its software development, achieving level‑3 continuous delivery maturity, accelerating release cycles, improving quality, and outlining future DevSecOps and industry‑specific standardization plans.

Operationscontinuous deliverydevops

0 likes · 11 min read

How Haitong Securities Boosted Efficiency with DevOps Standard Evaluation

Efficient Ops

Sep 9, 2021 · Operations

How China Construction Bank’s FinTech Arm Earned Top Marks in the National DevOps Standard

The article details how JiAnXin FinTech’s YaoGuang Agile Development Platform achieved an excellent rating in China’s first national DevOps standard evaluation, sharing interview insights on platform architecture, the importance of end‑to‑end toolchains, future DevOps trends, and the tangible benefits realized after the assessment.

FinTechOperationsPlatform

0 likes · 12 min read

How China Construction Bank’s FinTech Arm Earned Top Marks in the National DevOps Standard

Open Source Linux

Sep 4, 2021 · Operations

How to Use nologin to Block User Logins on Linux

This guide explains how the Linux nologin command can politely deny user logins, log attempts, and provides multiple methods—including command-line usage, password locking, and /etc/passwd modifications—to restrict login access for specific or all users during system maintenance.

LinuxOperationsUser Login

0 likes · 3 min read

How to Use nologin to Block User Logins on Linux

HelloTech

Sep 2, 2021 · Operations

How Production Full‑Link Load Testing Guarantees High Availability at Scale

The article explains why large‑scale services must conduct production full‑link load testing, describes its evolution from ad‑hoc trials to standardized monthly practices, and details the technical and procedural steps—including traffic modeling, JMeter usage, middleware tagging, and responsibility mapping—that ensure reliable capacity planning and risk mitigation.

High AvailabilityOperationscapacity planning

0 likes · 13 min read

How Production Full‑Link Load Testing Guarantees High Availability at Scale

Ops Development Stories

Aug 31, 2021 · Operations

Why Every Kubernetes Deployment Needs a Standardized Ops Playbook

This article shares practical standards for Kubernetes operations—including infrastructure, application, artifact, and CI/CD guidelines—to help teams simplify management, improve reliability, and foster continuous learning and sharing in fast‑moving cloud environments.

Best PracticesCI/CDKubernetes

0 likes · 14 min read

Why Every Kubernetes Deployment Needs a Standardized Ops Playbook

Liangxu Linux

Aug 29, 2021 · Operations

Boosting a Python Service to 50k QPS: My Step‑by‑Step Performance Tuning

Through a detailed case study, the author documents the process of optimizing a Python‑based web module—identifying bottlenecks, redesigning architecture with Redis queues, tuning MySQL, adjusting Linux TCP settings, and iteratively load‑testing until achieving 50,000 QPS with sub‑70 ms latency and zero errors.

OperationsOptimizationPerformance

0 likes · 9 min read

Boosting a Python Service to 50k QPS: My Step‑by‑Step Performance Tuning

Java Captain

Aug 28, 2021 · Databases

Understanding Linux Memory Usage and SQL Join Optimization in Technical Interviews

This article walks through common interview questions on Linux memory inspection and cache clearing, explains the fields shown by the free command, and then delves into SQL join types, performance bottlenecks, buffer settings, and practical optimization techniques for MySQL databases.

BuffersDatabase PerformanceLinux

0 likes · 7 min read

Understanding Linux Memory Usage and SQL Join Optimization in Technical Interviews

JD Retail Technology

Aug 24, 2021 · Operations

Key Metrics and Process for Lean Value Stream Analysis

The article explains how lean value‑stream analysis uses meaningful metrics such as lead time, process time and percent complete & accurate, outlines a step‑by‑step workflow for mapping and evaluating value streams, and demonstrates the approach with a department‑level case study and radar‑chart analysis.

LeanOperationsValue Stream

0 likes · 6 min read

Key Metrics and Process for Lean Value Stream Analysis

Efficient Ops

Aug 23, 2021 · Operations

Master HAProxy: Build High‑Performance L7/L4 Load Balancers & HA Clusters

This guide introduces HAProxy, an open‑source L4/L7 load balancer, and walks through its core features, performance and stability characteristics, step‑by‑step installation on CentOS 7, configuration of both L7 and L4 balancing, monitoring, and setting up high‑availability with Keepalived.

HAProxyHigh AvailabilityLinux

0 likes · 27 min read

Master HAProxy: Build High‑Performance L7/L4 Load Balancers & HA Clusters

MaGe Linux Operations

Aug 21, 2021 · Cloud Computing

Choosing the Right Cloud Platform and Load Balancing Strategy: A Practical Guide for Ops Engineers

This article explores how ops engineers can select suitable cloud platforms, evaluate major IaaS providers, choose appropriate programming languages, and implement effective load‑balancing solutions such as LVS, Nginx, HAProxy, and Alibaba Cloud SLB to ensure stable, scalable cloud operations.

IaaSOperationsload balancing

0 likes · 6 min read

Choosing the Right Cloud Platform and Load Balancing Strategy: A Practical Guide for Ops Engineers

IT Architects Alliance

Aug 21, 2021 · Operations

Mastering Nginx: From Basics to Advanced Load Balancing and Rate Limiting

This article explains what Nginx is, why it’s chosen for high‑performance reverse proxy and load balancing, walks through its event‑driven architecture, core configuration directives, virtual host setups, location regex rules, static‑dynamic separation, rate‑limiting techniques, load‑balancing algorithms, high‑availability settings and practical code examples.

NginxOperationsReverse Proxy

0 likes · 19 min read

Mastering Nginx: From Basics to Advanced Load Balancing and Rate Limiting

58UXD

Aug 20, 2021 · Operations

How the Ganjian Salary Wish Festival Boosted User Engagement

This article analyzes the Ganjian Salary Wish Festival as a case study of operational marketing, exploring industry insights, audience targeting, brand messaging, benefit‑driven conversion, interactive game design, and data results to reveal how such activities can sustainably retain users beyond simple incentives.

Case StudyMarketingOperations

0 likes · 5 min read

How the Ganjian Salary Wish Festival Boosted User Engagement

Architects' Tech Alliance

Aug 16, 2021 · Operations

The Evolution, Types, and Pitfalls of Enterprise Mid‑Platform Architecture

This article traces the history of the Chinese "mid‑platform" concept, outlines how major tech firms implement various middle‑platform strategies, distinguishes front‑end, back‑end, and middle layers, categorizes platform types, and highlights common pitfalls and organizational challenges in building such platforms.

Business ArchitectureEnterprise ArchitectureOperations

0 likes · 12 min read

The Evolution, Types, and Pitfalls of Enterprise Mid‑Platform Architecture

Code Ape Tech Column

Aug 13, 2021 · Operations

Why Kafka? Deep Dive into Architecture, Performance, and Production Deployment

This article explains the need for a messaging system, explores Kafka's core concepts, cluster architecture, performance optimizations like sequential disk writes and zero‑copy, and provides detailed guidance on sizing hardware, configuring producers and consumers, and managing a production Kafka deployment.

High AvailabilityKafkaOperations

0 likes · 32 min read

Why Kafka? Deep Dive into Architecture, Performance, and Production Deployment

Efficient Ops

Aug 11, 2021 · Operations

Scaling Kubernetes Clusters: Node Quotas, Kernel Tweaks & Etcd Tips

This guide outlines how to prepare large‑scale Kubernetes clusters on public clouds by increasing node quotas, adjusting kernel parameters, configuring high‑availability etcd with the etcd‑operator, tuning kube‑apiserver settings, and applying pod‑level best practices for resource limits and affinity.

Operationscluster scalingkernel tuning

0 likes · 8 min read

Scaling Kubernetes Clusters: Node Quotas, Kernel Tweaks & Etcd Tips

DevOps

Aug 11, 2021 · Operations

Introduction to Chaos Engineering – Part 2: Four Steps for Disrupting Complex Systems

This article explains that chaos engineering is not a magic cure but a disciplined practice for testing distributed systems by designing and running controlled experiments, outlining four essential steps—observability, defining steady state, hypothesizing events, and executing experiments—to gain confidence in system resilience.

ObservabilityOperationschaos engineering

0 likes · 11 min read

Introduction to Chaos Engineering – Part 2: Four Steps for Disrupting Complex Systems

DevOps

Aug 9, 2021 · Operations

Microsoft Digital: Internal IT Transformation and Operational Excellence

Microsoft Digital describes how Microsoft’s internal IT organization, renamed from CSEO to Microsoft Digital, drove a comprehensive digital transformation by migrating operations to Azure, adopting cloud‑centric architecture, implementing DevOps, enhancing security, data, and AI capabilities, and aligning vision‑driven priorities to boost productivity, customer focus, and business outcomes.

Data AnalyticsOperationsdigital transformation

0 likes · 20 min read

Microsoft Digital: Internal IT Transformation and Operational Excellence

Java Architect Essentials

Aug 6, 2021 · Operations

ByteDance Data Center Scale: Server Count, Bandwidth, and CDN Architecture

The article provides an overview of ByteDance's massive data center infrastructure, detailing server quantities, total outbound bandwidth reaching several terabits, the role of dual‑link designs, and how CDN acceleration enables billions of users to access Douyin and related services smoothly.

CDNData CenterOperations

0 likes · 8 min read

ByteDance Data Center Scale: Server Count, Bandwidth, and CDN Architecture

Alibaba Cloud Native

Aug 6, 2021 · Operations

Scaling Chaos Engineering at Qunar: Lessons from Thousands of Microservices

Qunar shares how it built a large‑scale chaos engineering platform for thousands of microservices, detailing tool selection, architecture, evolution stages, fault‑injection scenarios, strong/weak dependency automation, open‑source contributions, and future plans for automated random drills.

Fault InjectionOperationsReliability

0 likes · 9 min read

Scaling Chaos Engineering at Qunar: Lessons from Thousands of Microservices

Wukong Talks Architecture

Aug 6, 2021 · Databases

Redis Operational Best Practices and Guidelines

This guide presents a comprehensive set of mandatory, reference, and recommended Redis usage standards—including command restrictions, key naming, data sizing, persistence configurations, monitoring, and deployment strategies—to improve performance, reliability, and operational efficiency for production environments.

Best PracticesOperationsPerformance

0 likes · 9 min read

Redis Operational Best Practices and Guidelines

Efficient Ops

Aug 2, 2021 · Operations

How Alibaba Scales Massive Big Data Engines with an SRE Framework

This article describes Alibaba’s comprehensive SRE system for managing ultra‑large‑scale big data engines, detailing stability metrics, resource cost management, and intelligent operation productization, and introduces speaker Fu Tianyuan, a senior operations expert leading the MaxCompute and DataWorks SRE team.

AlibabaBig DataCloud Computing

0 likes · 3 min read

How Alibaba Scales Massive Big Data Engines with an SRE Framework

ByteDance SE Lab

Jul 30, 2021 · Operations

Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It

The article examines Salesforce’s five‑hour global outage caused by a shortcut DNS deployment and the subsequent recovery challenges, then explores a viral experiment where twenty smartphones generated artificial traffic congestion, illustrating how real‑time data feeds and operational safeguards can prevent large‑scale service disruptions.

Big DataCloud ComputingIncident Management

0 likes · 7 min read

Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It

Java High-Performance Architecture

Jul 30, 2021 · Operations

Essential Linux and Java Debugging Tools Every Engineer Should Know

This guide compiles a comprehensive set of Linux commands, Java troubleshooting utilities, JVM options, and IDE plugins that help developers diagnose performance issues, resolve jar conflicts, and monitor production systems efficiently.

JavaLinuxOperations

0 likes · 17 min read

Essential Linux and Java Debugging Tools Every Engineer Should Know

DevOps Cloud Academy

Jul 29, 2021 · Operations

Ensuring the CI/CD Pipeline Is the Sole Path to Production

The article emphasizes that a CI/CD pipeline must be the exclusive route for deploying immutable artifacts to production, warning against direct local deployments, highlighting risks of lost traceability, and urging strict network-level controls to ensure only the pipeline can release code.

CI/CDOperationsProduction Deployment

0 likes · 4 min read

Ensuring the CI/CD Pipeline Is the Sole Path to Production

ITFLY8 Architecture Home

Jul 29, 2021 · Mobile Development

How Mobile API Gateways Transform App Development and Scale High‑Traffic Services

Mobile API gateways act as protocol adapters between networks, centralizing services for mobile apps; the article explains their role at Alibaba, the evolution of R&D efficiency through unified programming models and SDKs, large‑scale platform development, high‑availability strategies, and the EMAS top‑level model for mobile development.

EMASMobile DevelopmentOperations

0 likes · 9 min read

How Mobile API Gateways Transform App Development and Scale High‑Traffic Services

Full-Stack Internet Architecture

Jul 28, 2021 · Operations

Common Open‑Source Tools for MySQL Operations and Maintenance

This article introduces a curated list of open‑source MySQL operational tools—including online DDL changers, backup and restore utilities, load‑testing frameworks, flashback solutions, slow‑query analyzers, replication consistency checkers, audit platforms, and graphical clients—explaining their principles, usage scenarios, and visual references.

MySQLOperationsPerformance

0 likes · 8 min read

Common Open‑Source Tools for MySQL Operations and Maintenance

DevOps

Jul 28, 2021 · Operations

Improving System Availability: Stages, Influencing Factors, and Practical Measures

This article explains system availability, outlines three stages of incident handling, identifies key factors that degrade availability such as human error, avalanche effects, untested releases and infrastructure failures, and proposes technical and team‑oriented practices to enhance reliability and achieve higher "nines" of uptime.

Incident ManagementOperationsReliability

0 likes · 11 min read

Improving System Availability: Stages, Influencing Factors, and Practical Measures

Open Source Linux

Jul 27, 2021 · Cloud Native

Why Coinbase Skips Kubernetes: Insights from Their Container Orchestration Journey

This article examines Coinbase's decision to avoid Kubernetes, tracing container technology history, outlining the operational and security challenges of orchestration platforms, and detailing the company's custom Odin+ASG solution and future considerations for container management.

Container OrchestrationKubernetesOperations

0 likes · 20 min read

Why Coinbase Skips Kubernetes: Insights from Their Container Orchestration Journey

Open Source Linux

Jul 27, 2021 · Operations

How to Effectively Locate and Debug Production Issues Using Logs and Remote Debugging

This guide walks beginners through understanding logs, using them for error tracing, applying monitoring and alerts, and performing remote debugging to quickly pinpoint and resolve production problems, emphasizing practical steps and best practices for reliable system maintenance.

OperationsRemote DebuggingTroubleshooting

0 likes · 7 min read

How to Effectively Locate and Debug Production Issues Using Logs and Remote Debugging

Efficient Ops

Jul 27, 2021 · Operations

What Does China’s 2021 DevOps Survey Reveal About Industry Trends?

On July 15, 2021, the China Academy of Information and Communications Technology unveiled the 2021 China DevOps Status Survey Report, detailing the nation’s digital transformation, the growing demand for rapid software delivery, the extensive multi‑company survey methodology, and key findings on DevOps adoption and future trends.

2021ChinaOperations

0 likes · 5 min read

What Does China’s 2021 DevOps Survey Reveal About Industry Trends?

Open Source Linux

Jul 26, 2021 · Operations

How Floods Tested Zhengzhou’s Telecom Backbone and What It Reveals About Network Resilience

Severe flooding in Zhengzhou crippled core telecom facilities, prompting emergency repairs, backup HLR deployment, and temporary authentication shutdowns, while highlighting the critical role of network resilience and disaster‑recovery strategies for maintaining communication services during natural disasters.

Disaster RecoveryHLRNetwork Resilience

0 likes · 7 min read

How Floods Tested Zhengzhou’s Telecom Backbone and What It Reveals About Network Resilience

Java Architect Essentials

Jul 25, 2021 · Backend Development

How I Cut Full GC Frequency by 80%: A JVM Tuning Case Study

Over a month of systematic JVM tuning reduced Full GC from 40 times per day to once every ten days and halved Young GC duration by adjusting heap sizes, survivor ratios, and metaspace settings while investigating and fixing a memory leak caused by an anonymous inner class listener.

Garbage CollectionJVMOperations

0 likes · 10 min read

How I Cut Full GC Frequency by 80%: A JVM Tuning Case Study

Architects' Tech Alliance

Jul 24, 2021 · Backend Development

How to Build a Scalable Backend Stack for Startups: Languages, Components, and Best Practices

This guide outlines a comprehensive backend technology stack for startups, covering language choices, core components, development processes, infrastructure services, database options, monitoring, CI/CD, and operational best practices to help teams design, select, and implement a reliable server-side architecture.

CloudOperationsbackend

0 likes · 31 min read

How to Build a Scalable Backend Stack for Startups: Languages, Components, and Best Practices

Efficient Ops

Jul 20, 2021 · Databases

Master Redis: 13 Proven Practices to Boost Memory, Performance & Reliability

Discover a comprehensive Redis best‑practice guide covering memory optimization, performance tuning, high reliability, daily operations, resource planning, monitoring, and security, with actionable tips such as key length control, maxmemory settings, lazy‑free, connection pooling, replication strategies, and safe deployment practices.

Database ManagementOperationsRedis

0 likes · 23 min read

Master Redis: 13 Proven Practices to Boost Memory, Performance & Reliability

Ops Development Stories

Jul 20, 2021 · Cloud Native

How to Build a Production‑Ready ELK Logging Stack on Kubernetes

This guide walks through the concepts of ELK, why log management is essential for Kubernetes, three collection strategies, required log fields, and step‑by‑step deployment of Elasticsearch, Kibana, Filebeat, and Logstash—including YAML manifests, configuration snippets, and Kibana UI setup—for a fully operational, cloud‑native logging solution.

ELKKibanaKubernetes

0 likes · 26 min read

How to Build a Production‑Ready ELK Logging Stack on Kubernetes

Youzan Coder

Jul 19, 2021 · Operations

How We Built a Robust Search Middle Platform: From Pain Points to Full‑Scale Quality Assurance

This article examines the challenges faced by a search middle platform—such as inaccurate impact assessment, unstable underlying clusters, and missing process standards—and details a comprehensive quality‑assurance strategy that includes baseline test suites, stability practices, performance testing, emergency drills, and systematic monitoring to ensure reliable search services.

OperationsSearchbackend

0 likes · 13 min read

How We Built a Robust Search Middle Platform: From Pain Points to Full‑Scale Quality Assurance

Liangxu Linux

Jul 18, 2021 · Operations

Why Switch to Dust? A Faster, Colorful Alternative to du for Disk Usage

Dust is a Rust‑written, visually rich replacement for the traditional du command that shows directory sizes as a colored tree, making it easier to spot large folders and understand disk consumption at a glance.

Operationsdisk usagedu alternative

0 likes · 5 min read

Why Switch to Dust? A Faster, Colorful Alternative to du for Disk Usage

Efficient Ops

Jul 18, 2021 · Operations

Master Ansible in 16 Visual Steps

Ansible, a rapidly popular open‑source automation tool built on Python, simplifies batch system configuration, program deployment, and command execution with thousands of built‑in modules, offering a beginner‑friendly yet powerful solution for modern operations teams.

AnsibleOperationsPython

0 likes · 3 min read

macrozheng

Jul 18, 2021 · Operations

Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures

In this article, a programmer recounts the recent Bilibili outage, analyzes its timeline, proposes technical root‑cause hypotheses such as CDN failure and service‑chain avalanche, shares insights from the platform’s high‑availability architecture, and outlines preventive techniques for building more resilient backend systems.

BilibiliCDNHigh Availability

0 likes · 10 min read

Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures

IT Architects Alliance

Jul 18, 2021 · Operations

How to Achieve Smooth Releases and AB Testing with Nginx: A Step‑by‑Step Guide

This article details a practical smooth‑release process for a cloud‑office system, explains how to use Nginx health‑check endpoints to take instances offline, and presents three AB‑testing routing strategies—IP‑based, cookie‑based, and AB‑cluster proxy—complete with configuration examples, pros and cons, and deployment steps.

AB testingBlue-Green DeploymentDeployment

0 likes · 9 min read

How to Achieve Smooth Releases and AB Testing with Nginx: A Step‑by‑Step Guide

21CTO

Jul 16, 2021 · Operations

What Bilibili’s Outage Teaches About Achieving True High Availability

The article analyzes Bilibili’s recent service outage, explains why high availability matters, introduces key metrics like MTBF and MTTR, and outlines practical strategies such as redundancy, rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region deployment to build resilient systems.

High AvailabilityMTBFMTTR

0 likes · 18 min read

What Bilibili’s Outage Teaches About Achieving True High Availability

High Availability Architecture

Jul 15, 2021 · Operations

Baidu Game Microservice Monitoring Practice and System Design

This article describes Baidu's comprehensive approach to monitoring game microservices, covering the background, initial monitoring tools, evolution of the monitoring system, systematic design for risk control, intelligent detection, alarm optimization, efficient fault localization, and future outlook for high‑availability architecture.

BaiduGame DevelopmentObservability

0 likes · 13 min read

Baidu Game Microservice Monitoring Practice and System Design

Code Ape Tech Column

Jul 15, 2021 · Operations

What Really Caused Bilibili’s Sudden Outage? A Deep Dive into the Technical Failure

The article analyzes Bilibili's recent half‑hour service disruption, explores technical rumors such as an etcd crash, examines Kubernetes‑based cloud‑native infrastructure, reviews similar historic outages, and offers expert recommendations for improving high‑availability and disaster‑recovery in large‑scale internet services.

BilibiliEtcdKubernetes

0 likes · 8 min read

What Really Caused Bilibili’s Sudden Outage? A Deep Dive into the Technical Failure

dbaplus Community

Jul 14, 2021 · Operations

How to Rapidly Diagnose and Resolve Common Online Service Failures

This guide walks through practical troubleshooting steps for typical production incidents—including disk exhaustion, high CPU, Java OOM, MySQL deadlocks and slow queries, Redis memory alerts, network TCP issues, and business‑log analysis—providing concrete commands, diagrams and mitigation strategies for each layer.

NetworkOperations

0 likes · 32 min read

How to Rapidly Diagnose and Resolve Common Online Service Failures

Baidu Geek Talk

Jul 14, 2021 · Operations

How Baidu Built a Robust Microservice Monitoring System for Game Services

This article details Baidu's comprehensive microservice monitoring practice for its game platform, covering the initial fragmented setup, systematic redesign across risk control, intelligent monitoring, smart alerting, and rapid fault localization, and presents the resulting monitoring architecture, visualizations, and future improvement goals.

AlertingBaiduOperations

0 likes · 14 min read

How Baidu Built a Robust Microservice Monitoring System for Game Services

Open Source Linux

Jul 11, 2021 · Operations

Mastering Shell Script Best Practices for Reliable Automation

This article outlines practical shell‑script guidelines for automating system and application operations, covering script header conventions, formatting, error handling, safe use of commands, variable handling, file packaging, pipeline restrictions, concurrency locks, logging, and risk‑mitigation strategies to make automation both efficient and secure.

Best PracticesLinuxOperations

0 likes · 10 min read

Mastering Shell Script Best Practices for Reliable Automation

Efficient Ops

Jul 11, 2021 · Operations

Mastering Incident Management: Principles and Methods for Effective Fault Handling

This guide outlines essential incident management principles—prioritizing business recovery and timely escalation—and presents practical methodologies such as restart, isolation, and downgrade, while detailing user impact handling, organizational roles, and post‑mortem best practices.

Incident ManagementOperationsescalation

0 likes · 10 min read

Mastering Incident Management: Principles and Methods for Effective Fault Handling

Java High-Performance Architecture

Jul 10, 2021 · Operations

Why Non‑Invasive Production Debugging Is the Missing 2021 DevOps Trend

The article reveals that while DevOps has exploded with trends like Hybrid Deployments, DataOps, and GitOps, the overlooked but critical shift toward non‑invasive production debugging—providing code‑level observability without disrupting services—will become essential for modern DevOps teams.

APMInstrumentationOperations

0 likes · 14 min read

Why Non‑Invasive Production Debugging Is the Missing 2021 DevOps Trend

Yuewen Technology

Jul 9, 2021 · Operations

Mastering Efficient Log Utilization: Best Practices for Logging and Collection

This article outlines how to design, print, collect, and manage online service logs efficiently—covering log levels, key information, formatting, rolling, local vs. remote storage, real‑time collection, and tool selection—to turn logs into a valuable debugging and analytics asset.

Elastic StackLoggingOperations

0 likes · 16 min read

Mastering Efficient Log Utilization: Best Practices for Logging and Collection

Selected Java Interview Questions

Jul 7, 2021 · Operations

Redis Monitoring Metrics and Commands Guide

This article provides a comprehensive overview of Redis monitoring metrics—including performance, memory, basic activity, persistence, and error indicators—along with recommended monitoring tools, configuration settings, and command-line examples for gathering and interpreting these metrics in production environments.

OperationsPerformanceRedis

0 likes · 7 min read

Redis Monitoring Metrics and Commands Guide

Alibaba Cloud Developer

Jul 6, 2021 · Operations

Mastering Release Strategies: Alibaba’s DevOps Playbook for Faster, Safer Deployments

This article surveys common software release strategies—stop‑the‑world, canary, gray/rolling, blue‑green, A/B testing, and traffic‑isolation—detailing their advantages, disadvantages, and ideal scenarios, and then presents Alibaba’s practical best‑practice guide for planning, monitoring, and continuously delivering high‑quality releases.

Blue-Green DeploymentCanary ReleaseContinuous Deployment

0 likes · 16 min read

Mastering Release Strategies: Alibaba’s DevOps Playbook for Faster, Safer Deployments

Efficient Ops

Jul 5, 2021 · Operations

10 Essential Practices to Prevent DBA and Ops Disasters

Learn ten practical strategies—from safe change rollbacks and cautious destructive commands to robust backups, clear prompts, vigilant monitoring, and disciplined handovers—that help DBAs and operations engineers avoid costly system failures and maintain reliable production environments.

OperationsOraclebackup

0 likes · 6 min read

10 Essential Practices to Prevent DBA and Ops Disasters

Top Architect

Jul 4, 2021 · Operations

Design and Implementation of a Simple Gray Release System

The article explains the concept of gray release, outlines a basic architecture with strategy configuration, execution, and service registry components, describes common traffic-splitting strategies, and details practical implementations using Nginx, gateway services, and complex scenarios involving data synchronization and message queues.

A/B testingDeploymentOperations

0 likes · 7 min read

Design and Implementation of a Simple Gray Release System

IT Architects Alliance

Jul 3, 2021 · Operations

JD.com Order Transfer, Inventory Management, and Fulfillment Workflow Overview

This article explains JD.com's order transfer process, inventory hierarchy, support relationships, order transfer and planning systems, fulfillment workflow, and risk control mechanisms, illustrating how millions of orders are allocated, scheduled, and executed across multiple warehouses and channels.

InventoryOperationsOrder Management

0 likes · 11 min read

JD.com Order Transfer, Inventory Management, and Fulfillment Workflow Overview

IT Architects Alliance

Jul 3, 2021 · Operations

Understanding JD.com Order Fulfillment, Order Splitting, and Amount Allocation Systems

This article explains JD.com's end‑to‑end order fulfillment process, the concepts of 211/411 delivery promises, the mechanisms behind order splitting by warehouse, merchant and payment dimensions, and how monetary discounts are proportionally allocated across SKUs.

JD.comOperationse-commerce

0 likes · 9 min read

Understanding JD.com Order Fulfillment, Order Splitting, and Amount Allocation Systems

Alibaba Cloud Native

Jun 30, 2021 · Operations

How We Built a Dual‑Center, High‑Availability RocketMQ Platform

This article explains why RocketMQ was chosen, describes its large‑scale usage, details the design and implementation of a same‑city dual‑center architecture with near‑by production and consumption, outlines failover mechanisms, governance practices, lessons learned, and future plans for the messaging platform.

Dual CenterGovernanceHigh Availability

0 likes · 15 min read

How We Built a Dual‑Center, High‑Availability RocketMQ Platform

Architects Research Society

Jun 29, 2021 · Operations

Understanding the Differences Between SCADA and DCS Systems

SCADA and DCS originated as separate control systems but have converged over time; SCADA focuses on distributed monitoring and data acquisition across wide geographic areas, while DCS emphasizes centralized control, and modern high‑bandwidth networks now allow them to operate together as a unified monitoring solution.

DCSOperationsSCADA

0 likes · 6 min read

Understanding the Differences Between SCADA and DCS Systems

DevOps

Jun 29, 2021 · Operations

Why Traditional Enterprise IT Departments Are Marginalized and How Digital Transformation Can Create a New IT

The article analyzes the current marginalization of IT departments in traditional enterprises due to limited value, hierarchical organization, and misaligned assessment, and proposes that digital transformation—redefining IT roles, aligning technology with business goals, and building a digital foundation—can turn IT into a profit‑center and strategic enabler.

IT transformationOperationsbusiness alignment

0 likes · 12 min read

Why Traditional Enterprise IT Departments Are Marginalized and How Digital Transformation Can Create a New IT

Tencent Cloud Developer

Jun 28, 2021 · Cloud Native

Effective Service Governance for Serverless: Challenges and Solutions

Effective serverless governance requires comprehensive observability, traffic management, and service registration built on Kubernetes, using either a mesh sidecar with Istio or an embedded SDK, to simplify complex operational tasks such as discovery, fault tolerance, gray releases, and metric correlation for large‑scale function deployments.

ObservabilityOperationsServerless

0 likes · 17 min read

Effective Service Governance for Serverless: Challenges and Solutions

DevOps

Jun 28, 2021 · Databases

When Deleting Databases Goes Wrong: Cases, Legal Risks, and DevOps Lessons

This article examines real-world database deletion incidents, the associated legal consequences, and how DevOps culture and operational best‑practices can turn such mistakes into learning opportunities rather than career‑ending failures.

Operationsdatabase deletiondevops

0 likes · 13 min read

When Deleting Databases Goes Wrong: Cases, Legal Risks, and DevOps Lessons

Programmer DD

Jun 27, 2021 · Operations

How ByteDance Powers Billions with Multi‑Terabit Data Center Bandwidth

The article examines how ByteDance, Douyin, TikTok and other Chinese tech giants operate massive data centers with terabit‑level outbound bandwidth, millions of servers, and extensive CDN and load‑balancing architectures to support hundreds of millions of concurrent users.

ByteDanceCDNData Center

0 likes · 9 min read

How ByteDance Powers Billions with Multi‑Terabit Data Center Bandwidth

Ops Development Stories

Jun 25, 2021 · Operations

How to Build Custom Zabbix Webhook Alerts with JavaScript (DingTalk Example)

This guide explains how Zabbix 4.4+ lets you use custom JavaScript in webhook media types to send alert notifications, details the built‑in Zabbix objects, shows configuration steps, data validation, logging rules, and provides a complete DingTalk webhook script with testing instructions.

AlertDingTalkJavaScript

0 likes · 11 min read

How to Build Custom Zabbix Webhook Alerts with JavaScript (DingTalk Example)

Java Architect Essentials

Jun 24, 2021 · Operations

Scaling WeChat Moments: Architecture, Capacity Planning, and Flexible Strategies for High Traffic

This article analyzes the large‑scale architecture of WeChat Moments, detailing image and video traffic characteristics, hardware and software safeguards, disaster‑recovery mechanisms, capacity assessment, and a series of flexible strategies such as compression format changes, bitrate reduction, buffer pools, and timeline throttling to handle holiday spikes.

Flexible StrategiesMomentsOperations

0 likes · 10 min read

Scaling WeChat Moments: Architecture, Capacity Planning, and Flexible Strategies for High Traffic

Efficient Ops

Jun 23, 2021 · Operations

Agent vs Network Data: Choosing the Right Cloud Performance Monitoring Approach

This article compares agent‑based and network‑data approaches to cloud‑native application performance monitoring, discussing their architectures, advantages, challenges, and how combining white‑box and black‑box techniques can improve fault detection, scalability, and operational efficiency in complex cloud environments.

AgentCloud MonitoringOperations

0 likes · 10 min read

Agent vs Network Data: Choosing the Right Cloud Performance Monitoring Approach

DevOps

Jun 22, 2021 · Operations

Building Digital Champion Capabilities: Integrating Customer Solutions, Operations, Technology, and Talent Ecosystems

The article outlines how digital‑champion enterprises achieve superior performance by integrating four core ecosystems—customer solutions, operations, technology, and talent—through strategic planning, partnership, and advanced technologies such as AI, big data, and industrial IoT, while highlighting maturity stages and practical implementation steps.

Artificial IntelligenceBig DataOperations

0 likes · 28 min read

Building Digital Champion Capabilities: Integrating Customer Solutions, Operations, Technology, and Talent Ecosystems

ByteDance Terminal Technology

Jun 21, 2021 · Information Security

CI/CD Business Security Compliance Detection: Challenges, Improvements, and Benefits

This article outlines the background, current challenges, and recent enhancements of CI/CD‑integrated business security compliance detection for mobile apps, including incremental source‑code scanning, call‑graph analysis, and performance gains, while also discussing future directions and benefits.

AndroidCI/CDOperations

0 likes · 13 min read

CI/CD Business Security Compliance Detection: Challenges, Improvements, and Benefits

Practical DevOps Architecture

Jun 18, 2021 · Operations

Step-by-Step Guide to Installing Docker, MySQL, and Apollo Services on CentOS

This tutorial provides detailed commands and configurations for installing a specific Docker version, setting up persistent MySQL containers, importing Apollo database scripts, and deploying Apollo Config, Admin, and Portal services using Docker on a CentOS host.

ApolloDockerOperations

0 likes · 5 min read

Step-by-Step Guide to Installing Docker, MySQL, and Apollo Services on CentOS

dbaplus Community

Jun 17, 2021 · Cloud Native

How Dada Achieved Seamless Elastic Scaling for Massive Delivery Peaks

Facing surges during holidays and major shopping events, Dada’s DevOps team built a cloud‑native elastic scaling system that combines fine‑grained capacity management, multi‑cloud support, metric‑driven auto‑scaling, and extreme‑scale down strategies, delivering stable delivery performance while cutting costs.

Auto ScalingElastic ScalingMulti-Cloud

0 likes · 17 min read

How Dada Achieved Seamless Elastic Scaling for Massive Delivery Peaks

HomeTech

Jun 16, 2021 · R&D Management

Technical Debt Governance in Autohome's Cloud Platform: Theory and Practice

This article presents Autohome's Cloud Platform (Home Cloud) technical debt governance framework, defining ideal technical states, outlining five systematic steps—from factor collection to project execution—and sharing practical outcomes that have enhanced the competitiveness of its applications and development teams.

OperationsR&D ManagementSoftware engineering

0 likes · 7 min read

Technical Debt Governance in Autohome's Cloud Platform: Theory and Practice

DevOps

Jun 16, 2021 · Operations

Understanding Digital Transformation: Definitions, Strategic Questions, Drivers, Frameworks, Roadmaps, Benefits and Pitfalls

The article provides a comprehensive overview of digital transformation, covering its definition, essential strategic questions, key drivers such as customer expectations, cloud and AI, priority areas in the value chain, practical frameworks, roadmap steps, expected benefits and common reasons for failure.

Artificial IntelligenceBig DataOperations

0 likes · 20 min read

Understanding Digital Transformation: Definitions, Strategic Questions, Drivers, Frameworks, Roadmaps, Benefits and Pitfalls

Efficient Ops

Jun 15, 2021 · Operations

Mastering IT Monitoring: Strategies, Challenges, and Best Practices

This article explores the fundamentals of IT monitoring, examines common challenges such as scalability, reliability, and alert fatigue, compares four implementation approaches—from open‑source to fully custom solutions—and presents practical techniques like alert convergence, suppression, and automation to build a robust, adaptable monitoring platform.

Alert ManagementOperationsSystem Design

0 likes · 19 min read

Mastering IT Monitoring: Strategies, Challenges, and Best Practices

Java Backend Technology

Jun 14, 2021 · Operations

How ByteDance Powers Billions of Users with Multi‑Terabit Data Center Bandwidth

The article examines ByteDance's massive data‑center infrastructure, detailing server counts, multi‑terabit outbound bandwidth, dual‑link designs, CDN acceleration, and comparisons with other Chinese tech giants, illustrating how such scale enables seamless video streaming for hundreds of millions of daily users.

ByteDanceCDNData Center

0 likes · 9 min read

How ByteDance Powers Billions of Users with Multi‑Terabit Data Center Bandwidth

Code Ape Tech Column

Jun 9, 2021 · Operations

Understanding Disaster Recovery vs. Backup: Key Differences and Best Practices

This article explains what disaster recovery is, distinguishes it from backup, outlines classification types, compares their core differences, and details four DR maturity levels with their advantages and drawbacks to help organizations build resilient data protection strategies.

Data ProtectionDisaster RecoveryHigh Availability

0 likes · 10 min read

Understanding Disaster Recovery vs. Backup: Key Differences and Best Practices

Efficient Ops

Jun 8, 2021 · Operations

How Red‑Blue Drills Boost Securities Ops: From Capacity Testing to Full‑Scale Automation

Lin Ying, a senior test manager at Guoxin Securities, shares insights from his GOPS 2021 talk on the securities industry's digital transformation, current IT challenges, and a comprehensive red‑blue exercise strategy that combines full‑link load testing, automated workflows, and proactive monitoring to ensure system stability during market peaks.

Operationscapacity testingdevops

0 likes · 13 min read

How Red‑Blue Drills Boost Securities Ops: From Capacity Testing to Full‑Scale Automation

IT Architects Alliance

Jun 5, 2021 · Operations

How to Tame Massive Product‑Sync Traffic in a Multi‑Store Chain System

This article analyzes the stability challenges of a multi‑store chain’s product‑copy mechanism, outlines design goals for isolation and scalability, and presents short‑ and long‑term monitoring, flow‑control, and emergency‑response strategies to ensure reliable large‑scale operations.

Flow ControlOperationsSystem Design

0 likes · 12 min read

How to Tame Massive Product‑Sync Traffic in a Multi‑Store Chain System

Ops Development Stories

Jun 4, 2021 · Operations

Step-by-Step Guide to Monitoring MongoDB with Zabbix Agent 2

This tutorial explains how to use Zabbix Agent 2 to monitor MongoDB databases and clusters, covering the required read‑only user setup, relevant Zabbix templates, key metrics such as jumbo chunks, connection pool stats, server status, collection and replSet information, and practical configuration examples.

Agent2MongoDBOperations

0 likes · 6 min read

Step-by-Step Guide to Monitoring MongoDB with Zabbix Agent 2

DevOps

Jun 2, 2021 · Operations

Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering

Slack’s resilience engineering team outlines a structured chaos‑engineering workflow—identifying potential failures, ensuring fault tolerance, and deliberately injecting faults in development and production—to safely test system robustness, validate hypotheses, and continuously improve reliability through regular disaster‑theater exercises.

Fault InjectionOperationsReliability

0 likes · 11 min read

Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering

Efficient Ops

Jun 1, 2021 · Operations

Mastering System Stability: Building a Chaos‑Driven Platform for Financial Ops

This article details how a major securities firm analyzed business stability, built a comprehensive stability engineering platform using chaos engineering, practiced extensive fault‑injection drills, and outlines future directions such as random‑scenario exercises, red‑blue battles, and AI‑driven risk detection.

OperationsPlatformchaos engineering

0 likes · 11 min read

Mastering System Stability: Building a Chaos‑Driven Platform for Financial Ops

Efficient Ops

Jun 1, 2021 · Artificial Intelligence

How Time‑Series Analysis Powers AIOps: Overcoming Real‑World Challenges

At the 16th GOPS Global Operations Conference, Shen Hui of DingMao Technology explained how time‑series data analysis underpins AIOps, outlining its four‑step workflow, key challenges, and the company’s three‑pipeline solution that enables trend forecasting, fault prediction, and a robust AI‑driven operational platform.

AIAIOpsOperations

0 likes · 7 min read

How Time‑Series Analysis Powers AIOps: Overcoming Real‑World Challenges