Tagged articles
3281 articles
Page 19 of 33
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Mar 29, 2021 · Operations

How Yanxuan Designs a Multi‑Layer Inventory System for Seamless E‑Commerce

This article examines Yanxuan's inventory management challenges and presents a layered design—warehouse, physical, and sales layers—combined with a flexible lock‑stock pool to support multi‑channel, multi‑warehouse operations while addressing promotional and special‑use inventory requirements.

Layered DesignOperationsSupply Chain
0 likes · 14 min read
How Yanxuan Designs a Multi‑Layer Inventory System for Seamless E‑Commerce
Open Source Linux
Open Source Linux
Mar 27, 2021 · Operations

Master Huawei Switch Configuration: From VLAN Setup to Link Aggregation

This guide walks you through essential Huawei switch commands, covering user and view modes, VLAN creation, port link types, batch operations, initial login procedures, service activation, link aggregation, DHCP setup, and flow‑control configuration, all illustrated with step‑by‑step screenshots.

HuaweiOperationsVLAN
0 likes · 9 min read
Master Huawei Switch Configuration: From VLAN Setup to Link Aggregation
MaGe Linux Operations
MaGe Linux Operations
Mar 24, 2021 · Operations

Master Linux Server Health: Essential Monitoring Commands Explained

Learn how to monitor Linux server performance using essential tools—top, vmstat, pidstat, iostat, netstat, sar, and tcpdump—understanding CPU, memory, disk I/O, and network metrics, interpreting their outputs, and applying insights to diagnose and troubleshoot system issues effectively.

CLI toolsLinuxOperations
0 likes · 19 min read
Master Linux Server Health: Essential Monitoring Commands Explained
转转QA
转转QA
Mar 24, 2021 · Operations

Online Issue Analysis Process and Practices at Zhuanzhuan

This article outlines Zhuanzhuan's systematic approach to collecting, classifying, and resolving online user issues through weekly QA‑led meetings, detailed problem categorization, and targeted solutions, ultimately improving product design, reducing repetitive work, and enhancing overall operational efficiency.

OperationsQA processonline issue analysis
0 likes · 8 min read
Online Issue Analysis Process and Practices at Zhuanzhuan
dbaplus Community
dbaplus Community
Mar 23, 2021 · Operations

Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage

A production RocketMQ cluster suffered a 10‑minute message‑send timeout after a broker’s memory failure caused a server reboot, and the post analyzes the routing registration, name‑server removal logic, client‑side timeout handling, root cause of name‑server “dead‑lock”, and proposes deployment and code‑level fixes to prevent recurrence.

MessagingOperationsRocketMQ
0 likes · 9 min read
Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage
58UXD
58UXD
Mar 23, 2021 · Product Management

How 58.com’s Spring Recruitment Campaign Boosted User Activation Through Cohesive Design

The article details how 58.com’s annual Spring recruitment campaign was strategically designed with a two‑phase timeline, unified interaction framework, and immersive storytelling to drive new user acquisition, sustained activation, and significant traffic growth during the post‑holiday hiring surge.

OperationsUI/UXcase-study
0 likes · 8 min read
How 58.com’s Spring Recruitment Campaign Boosted User Activation Through Cohesive Design
Efficient Ops
Efficient Ops
Mar 22, 2021 · Operations

Boosting Operational Efficiency: Process, Tools, and Engineering Insights

This article explores practical ways to improve operational efficiency by examining process optimization, tool adoption, quality considerations, and engineering practices, highlighting real-world examples like OA, CICD, Spring Cloud, Java, and Kubernetes while emphasizing shared value and cultural factors.

Engineering managementOperationsefficiency
0 likes · 7 min read
Boosting Operational Efficiency: Process, Tools, and Engineering Insights
Baidu Geek Talk
Baidu Geek Talk
Mar 22, 2021 · Operations

How Baidu Achieved 99.999% Uptime for Its Massive Feed Recommendation System

This article details Baidu's Feed recommendation system architecture, explaining how a combination of dynamic retry scheduling, real‑time stop‑loss mechanisms, multi‑recall frameworks, ranking layer fallbacks, and IDC‑level multi‑master designs collectively ensure five‑nine availability across billions of daily requests.

Distributed SystemsMicroservicesOperations
0 likes · 18 min read
How Baidu Achieved 99.999% Uptime for Its Massive Feed Recommendation System
Taobao Frontend Technology
Taobao Frontend Technology
Mar 18, 2021 · Operations

How to Build a Robust Log Analysis System for Stable Microservices

Amid microservice and distributed architectures, this article explains how to design a comprehensive log analysis system—covering collection, storage, consumption, key data points, collection methods, and practical use cases like automated test generation, issue localization, and real‑time exception monitoring—to ensure system stability.

MicroservicesOperationslogging
0 likes · 13 min read
How to Build a Robust Log Analysis System for Stable Microservices
21CTO
21CTO
Mar 16, 2021 · Operations

How Cloud Computing Is Redefining Operations: Trends, Challenges, and Strategies

The article examines how the rapid adoption of cloud computing, DevOps, AIOps, and FinOps is reshaping the role of IT operations, highlighting new trends, evolving work boundaries, and the essential characteristics of a modern, automated, secure, and cost‑optimized operations system.

Cost OptimizationDevOpsFinOps
0 likes · 18 min read
How Cloud Computing Is Redefining Operations: Trends, Challenges, and Strategies
DevOps Cloud Academy
DevOps Cloud Academy
Mar 16, 2021 · Information Security

Best Practices for Implementing DevSecOps: Security Model, Governance, Automation, and Training

The article outlines six key DevSecOps best practices—including establishing a security model, enforcing governance policies, automating security tasks, training developers, applying network segmentation, and limiting administrative privileges—to help organizations overcome staffing and collaboration challenges while maintaining consistent security throughout the development and operations lifecycle.

DevSecOpsOperationsautomation
0 likes · 4 min read
Best Practices for Implementing DevSecOps: Security Model, Governance, Automation, and Training
High Availability Architecture
High Availability Architecture
Mar 15, 2021 · Operations

OCTO 2.0: Architecture and Implementation of Meituan’s Next‑Generation Service Governance System

This article introduces OCTO 2.0, Meituan’s next‑generation distributed service‑governance platform, detailing its overall architecture, mesh‑related features such as traffic hijacking, service subscription, lossless hot‑restart, data‑plane operations, and future cloud‑native evolution plans.

Distributed SystemsHot RestartOperations
0 likes · 13 min read
OCTO 2.0: Architecture and Implementation of Meituan’s Next‑Generation Service Governance System
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 13, 2021 · Operations

Comprehensive Guide to Monitoring: Objectives, Methods, Tools, and Best Practices

This article provides an in‑depth overview of monitoring, covering its purpose, key objectives, practical methods, core processes, a detailed comparison of popular monitoring tools such as Zabbix and Prometheus, and best‑practice recommendations for building scalable, reliable, and intelligent monitoring platforms.

InfrastructureOperationsPrometheus
0 likes · 42 min read
Comprehensive Guide to Monitoring: Objectives, Methods, Tools, and Best Practices
Efficient Ops
Efficient Ops
Mar 12, 2021 · Operations

Master Linux Network Commands: From netstat to ss and tcpdump

This guide offers a practical overview of essential Linux networking tools—including netstat, ss, sar, iftop, and tcpdump—explaining how to monitor connections, analyze traffic, capture packets, and tune kernel parameters to handle massive connection loads efficiently.

Operationsnetstatnetwork
0 likes · 10 min read
Master Linux Network Commands: From netstat to ss and tcpdump
Liangxu Linux
Liangxu Linux
Mar 8, 2021 · Operations

How to Locate and Release Disk Space from Deleted Open Files on Linux

When disk usage reported by df doesn't match du and inodes are not full, the discrepancy often stems from files that were deleted while still held open by running processes, and this guide explains how to identify those processes and safely free the space.

Deleted FilesLinuxOperations
0 likes · 4 min read
How to Locate and Release Disk Space from Deleted Open Files on Linux
Efficient Ops
Efficient Ops
Mar 8, 2021 · Operations

Master Linux Automation: Startup Scripts, at, and crontab Explained

This guide walks you through essential Linux automation techniques—including boot‑time service startup with chkconfig and rc.d, one‑off scheduling using at, and recurring jobs with crontab and shell scripts—so you can manage web servers efficiently without manual intervention.

Operationsatautomation
0 likes · 9 min read
Master Linux Automation: Startup Scripts, at, and crontab Explained
Youku Technology
Youku Technology
Mar 5, 2021 · Industry Insights

How Youku Built a Service‑Side Quality Assurance System to Boost Release Quality

This article outlines Youku's end‑to‑end service‑side quality assurance framework, detailing the factors that affect quality across the development lifecycle, the automated testing practices integrated into the release pipeline, the platform capabilities built for data collection and replay, and the metrics used to measure improvements in reliability and development efficiency.

Backend testingOperationsautomation
0 likes · 12 min read
How Youku Built a Service‑Side Quality Assurance System to Boost Release Quality
Continuous Delivery 2.0
Continuous Delivery 2.0
Mar 4, 2021 · Operations

Transformational Leadership in DevOps: Findings from the Accelerate Study

The article explains how transformational leadership—characterized by vision, inspirational communication, intellectual stimulation, supportive behavior, and individualized recognition—drives high‑performing DevOps teams, linking leadership traits to software delivery outcomes and offering practical guidance for managers to foster a strong engineering culture.

OperationsTransformation
0 likes · 9 min read
Transformational Leadership in DevOps: Findings from the Accelerate Study
Efficient Ops
Efficient Ops
Feb 25, 2021 · Operations

How Autonomous Networks Are Driving Telecom Industry Transformation

In a keynote at MWC 2021 Shanghai, Huawei’s Dr. Che Haiping explained how the Autonomous Networks initiative, likened to Industry 4.0, is reshaping telecom production, enabling flexible, intent‑driven services and accelerating digital transformation across the sector.

Digital InfrastructureHuaweiOperations
0 likes · 7 min read
How Autonomous Networks Are Driving Telecom Industry Transformation
Suning Technology
Suning Technology
Feb 25, 2021 · Operations

How to Optimize O2O Delivery Fulfillment for Maximum Efficiency?

This article analyzes the rapid growth of O2O home‑delivery, examines the challenges of delivery fulfillment, compares third‑party and self‑built rider models, and presents hybrid, batch‑ordering, and AI‑driven optimization strategies to reduce costs and boost efficiency.

LogisticsO2OOperations
0 likes · 10 min read
How to Optimize O2O Delivery Fulfillment for Maximum Efficiency?
Taobao Frontend Technology
Taobao Frontend Technology
Feb 25, 2021 · Frontend Development

How a Micro‑Frontend Workbench Boosts E‑Commerce Operations Efficiency

An Operations Workbench built on a micro‑frontend architecture unifies fragmented e‑commerce tools into standardized Process Units and SOPs, boosting operational efficiency, reducing technical duplication, and providing data‑driven, consistent experiences for thousands of operators across hundreds of business scenarios.

OperationsSOParchitecture
0 likes · 23 min read
How a Micro‑Frontend Workbench Boosts E‑Commerce Operations Efficiency
58UXD
58UXD
Feb 25, 2021 · Product Management

How Game Mechanics Boost Engagement in Live Dating Streams

This article examines the design of the "Spring Heat Value Competition" activity for a local live‑dating platform, detailing how game‑like mechanics, emotional curves, and visual styling were used to strengthen host‑user relationships, increase user participation, and drive a 116% rise in gift revenue.

OperationsProduct Designdating
0 likes · 6 min read
How Game Mechanics Boost Engagement in Live Dating Streams
FunTester
FunTester
Feb 24, 2021 · Operations

How to Build a Real‑World API Load Test for a Knowledge‑Base Service

This article walks through the design, scenario planning, and Java implementation of a fixed‑thread load test that simulates teacher login, knowledge‑point queries, course recommendations, and collect/uncollect actions, then presents the resulting performance metrics.

API testingBackendLoad Testing
0 likes · 8 min read
How to Build a Real‑World API Load Test for a Knowledge‑Base Service
New Oriental Technology
New Oriental Technology
Feb 22, 2021 · Operations

Full-Chain Load Testing Theory, Model Design (DESP) and New Oriental Continuation Class Case Study

This article introduces the fundamentals of full‑chain load testing, explains why it is essential for large‑scale distributed systems, outlines the DESP model with its four simulation dimensions, and presents a detailed case study of New Oriental's continuation‑class platform including architecture, data preparation, load design, automation and recruitment information.

Load TestingOperationsPerformance Testing
0 likes · 14 min read
Full-Chain Load Testing Theory, Model Design (DESP) and New Oriental Continuation Class Case Study
Code Ape Tech Column
Code Ape Tech Column
Feb 20, 2021 · Operations

Bug Tracking Workflow and Tool Comparison

This article defines bug tracking, outlines essential workflow steps and report contents, and evaluates a range of bug tracking tools—including BugHerd, Bugzilla, MantisBT, DebugMe, Donedone, Marker.io, Jira, Bughost, Zoho, Backlog, and Redmine—highlighting their features, integrations, pricing, advantages, and drawbacks to help teams choose the right solution.

OperationsSoftware toolsbug tracking
0 likes · 16 min read
Bug Tracking Workflow and Tool Comparison
php Courses
php Courses
Feb 18, 2021 · Operations

How to Create and Manage Swap Partitions on Linux

This article explains the purpose of swap partitions, shows how to check current swap usage, and provides step‑by‑step instructions for creating swap space both via a dedicated disk partition and by using a swap file on a Linux server.

LinuxOperationsPartition
0 likes · 3 min read
How to Create and Manage Swap Partitions on Linux
Programmer DD
Programmer DD
Feb 18, 2021 · Operations

How Gray Release Enables Safe, Rapid Feature Rollouts in Production

This article explains the concept of gray release, outlines a simple architecture with essential components, describes common routing strategies, and shows how to implement gray releases using Nginx, gateway services, and complex multi‑service scenarios to ensure controlled, low‑risk deployments.

A/B testingDeployment StrategyOperations
0 likes · 7 min read
How Gray Release Enables Safe, Rapid Feature Rollouts in Production
MaGe Linux Operations
MaGe Linux Operations
Feb 15, 2021 · Operations

5 Essential Practices to Safely Back Up Your Kubernetes Workloads

This article outlines five best‑practice steps—including considering cluster architecture, planning recovery, simplifying operations, ensuring security, and leveraging Kubernetes portability—to help organizations reliably back up applications and data in Kubernetes environments.

BackupData ProtectionKubernetes
0 likes · 7 min read
5 Essential Practices to Safely Back Up Your Kubernetes Workloads
Liangxu Linux
Liangxu Linux
Feb 11, 2021 · Operations

Master Linux Services: Essential systemctl Commands Explained

This guide walks you through using the systemctl tool on modern Linux distributions to start, stop, restart, reload, enable, disable, and query services, manage system power, work with targets, handle remote hosts, and leverage related utilities like journalctl, systemd-analyze, and hostnamectl.

LinuxOperationsService Management
0 likes · 8 min read
Master Linux Services: Essential systemctl Commands Explained
Architects' Tech Alliance
Architects' Tech Alliance
Feb 7, 2021 · Operations

Understanding the Essence and Implementation of Enterprise Digital Transformation

The article explains what digital transformation truly means for enterprises, outlines its three development stages, describes the core connection‑data‑intelligence framework, compares internal capability rebuilding with external ecosystem integration, and offers practical guidance on why and how companies should embark on digital transformation.

Big DataDigital TransformationEnterprise
0 likes · 24 min read
Understanding the Essence and Implementation of Enterprise Digital Transformation
Liangxu Linux
Liangxu Linux
Feb 6, 2021 · Operations

How to Make a Bash Script Run Only Once: Lock Files and flock Explained

This guide shows how to prevent a Bash script from being executed multiple times by detecting existing instances, using lock files with process checks, and employing the flock command for reliable atomic locking, complete with practical code examples and pitfalls to avoid.

BashOperationsSingleton
0 likes · 8 min read
How to Make a Bash Script Run Only Once: Lock Files and flock Explained
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Feb 4, 2021 · Operations

How NetEase Cloud Communication Builds a Real-Time Service Monitoring Platform

NetEase Cloud Communication’s service monitoring platform leverages data collection, preprocessing, alerting, and visualization pipelines—using HTTP APIs, Kafka, custom scripts, and NTSDB—to provide real-time insights, ensure stability, and support scalable, high‑throughput audio‑video services.

Operationscloud communicationdata pipeline
0 likes · 11 min read
How NetEase Cloud Communication Builds a Real-Time Service Monitoring Platform
21CTO
21CTO
Feb 3, 2021 · Operations

Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle

This article explains the role of Site Reliability Engineering (SRE) in bridging product and foundational technology development, outlines the software lifecycle, describes how SRE ensures system stability through controllability, observability, and protection, and provides practical best‑practice checklists and maturity levels for evaluating and improving reliability.

OperationsSRESite Reliability Engineering
0 likes · 13 min read
Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle
Efficient Ops
Efficient Ops
Feb 1, 2021 · Operations

How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops

This article explains how internet companies can reduce soaring manual operations costs by applying intelligent monitoring techniques—such as pattern recognition and statistical anomaly detection—to automatically identify abnormal nodes among thousands of servers, streamline fault diagnosis, and improve service quality.

Operationsanomaly detectionlarge-scale systems
0 likes · 4 min read
How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops
Open Source Linux
Open Source Linux
Jan 29, 2021 · Operations

Essential Kubernetes Production Best Practices for Secure, Scalable Ops

This article outlines comprehensive production‑grade Kubernetes best practices—including health probes, RBAC, resource management, network policies, monitoring, autoscaling, image security, and zero‑downtime strategies—to help teams run secure, efficient, and highly available workloads.

KubernetesOperationsautoscaling
0 likes · 11 min read
Essential Kubernetes Production Best Practices for Secure, Scalable Ops
Laravel Tech Community
Laravel Tech Community
Jan 27, 2021 · Operations

baulk 2.0 Introduces Experimental untar and unzip Commands with Advanced ZIP Features

The Windows‑only baulk package manager version 2.0 adds experimental untar and unzip sub‑commands, detailing untar's support for various tar formats and unzip's robust baulk::archive::zip implementation that handles many compression methods, filename encoding detection, SIMD‑accelerated decompression, and strict path security.

OperationsWindowsarchive
0 likes · 3 min read
baulk 2.0 Introduces Experimental untar and unzip Commands with Advanced ZIP Features
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 27, 2021 · Operations

How to Build Sustainable System Stability: Architecture, Ops, and Team Practices

This article shares practical insights from a technical leader on designing robust system architecture, implementing comprehensive capacity planning, establishing reliable operations processes, strengthening security, and cultivating team awareness to achieve long‑term stability for large‑scale internet services.

Operationsarchitecture designcapacity planning
0 likes · 24 min read
How to Build Sustainable System Stability: Architecture, Ops, and Team Practices
Programmer DD
Programmer DD
Jan 23, 2021 · Information Security

What Is a Bastion Host and Why It’s Essential for Secure Operations

This article explains the concept, purpose, design principles, core features, authentication methods, deployment options, and popular open‑source and commercial solutions of bastion hosts, highlighting how they centralize access control, audit operations, and improve overall IT security and compliance.

Bastion HostOperationsaccess control
0 likes · 9 min read
What Is a Bastion Host and Why It’s Essential for Secure Operations
DevOps Cloud Academy
DevOps Cloud Academy
Jan 21, 2021 · Operations

DevOps: Unifying Development, Operations, and QA

This article defines DevOps, outlines its benefits and drawbacks, and explains key concepts such as automation, CI/CD, multi‑environment deployments, early failure detection, rollback, policy enforcement, and observability, showing how they collectively improve software delivery and organizational collaboration.

DevOpsOperationsautomation
0 likes · 14 min read
DevOps: Unifying Development, Operations, and QA
转转QA
转转QA
Jan 19, 2021 · Operations

Comprehensive Full-Link Performance Testing Process and Practices for E-commerce Platforms

This article details a systematic full‑link performance testing workflow—including background, timing, scenario design, data preparation, capacity planning, monitoring, issue analysis, and post‑test cleanup—aimed at reliably evaluating and scaling e‑commerce services during major promotional events.

OperationsPerformance Testingcapacity planning
0 likes · 18 min read
Comprehensive Full-Link Performance Testing Process and Practices for E-commerce Platforms
FunTester
FunTester
Jan 18, 2021 · Operations

Stress vs. Load Testing: Fixed Threads vs. Fixed QPS Explained

This article clarifies the distinction between stress testing and load testing, describing their respective models—fixed‑thread and fixed‑QPS—along with key metrics, formulas, and practical benefits for evaluating system performance under varying workloads.

Load TestingOperationsPerformance Testing
0 likes · 5 min read
Stress vs. Load Testing: Fixed Threads vs. Fixed QPS Explained
MaGe Linux Operations
MaGe Linux Operations
Jan 16, 2021 · Operations

Master Secure File Transfers with scp: A Step‑by‑Step Guide

Learn how to securely copy files between local and remote machines using the scp command, covering basic syntax, authentication options, copying to and from remote hosts, recursive directory transfers, and transferring between two remote servers, all illustrated with practical examples.

LinuxOperationsSSH
0 likes · 5 min read
Master Secure File Transfers with scp: A Step‑by‑Step Guide
21CTO
21CTO
Jan 15, 2021 · Operations

How iQIYI Scaled Its Payment System with Full‑Link Load Testing

This article details iQIYI's end‑to‑end load‑testing methodology for its payment platform, covering problem identification, core‑link mapping, environment setup, realistic traffic modeling, execution safeguards, results from capacity verification and stress testing, and future plans for a unified testing solution.

Load TestingOperationscapacity planning
0 likes · 12 min read
How iQIYI Scaled Its Payment System with Full‑Link Load Testing
Liangxu Linux
Liangxu Linux
Jan 14, 2021 · Operations

How to Resolve Stuck Kubernetes Resources, Reset etcd, and Fix API Server Errors

This guide explains how to delete inconsistent Kubernetes rc, deployment, and service objects, reset etcd data, address apiserver start failures caused by missing ServiceAccount certificates, disable SELinux for fluentd logs, generate ServiceAccount keys, recover from etcd startup errors, configure host trust, change hostnames, enable VirtualBox copy‑paste, force‑delete pods and namespaces, and avoid resource‑request‑only containers causing contention.

ClusterOperationsetcd
0 likes · 17 min read
How to Resolve Stuck Kubernetes Resources, Reset etcd, and Fix API Server Errors
Liangxu Linux
Liangxu Linux
Jan 12, 2021 · Information Security

What Is a Bastion Host and How Does It Secure Operations?

This article explains the concept, purpose, design principles, functional modules, authentication methods, deployment options, and open‑source implementations of bastion hosts, highlighting how they centralize control, audit, and protect privileged access to servers and network devices.

AuthenticationBastion HostDeployment
0 likes · 9 min read
What Is a Bastion Host and How Does It Secure Operations?
Efficient Ops
Efficient Ops
Jan 12, 2021 · Operations

Master Nginx: From Reverse Proxy Basics to High‑Availability Load Balancing

This article explains Nginx’s core concepts—including reverse proxy, load balancing, static‑dynamic separation, common commands, configuration blocks, and high‑availability setup with Keepalived—providing step‑by‑step guidance and practical examples for building robust web infrastructure.

NginxOperationshigh availability
0 likes · 11 min read
Master Nginx: From Reverse Proxy Basics to High‑Availability Load Balancing
NetEase Game Operations Platform
NetEase Game Operations Platform
Jan 9, 2021 · Operations

Real-Time Log Intelligent Classification Practice

This article describes how NetEase built a real‑time log intelligent classification system using Flink and AI algorithms, detailing the challenges of massive log volumes, the Drain template‑extraction method, algorithm workflow, performance results, and a practical case study that demonstrates reduced alert storms and faster issue diagnosis.

AIDrain algorithmFlink
0 likes · 15 min read
Real-Time Log Intelligent Classification Practice
Alibaba Terminal Technology
Alibaba Terminal Technology
Jan 6, 2021 · Frontend Development

Mastering Gray-Scale Cross-Platform Monitoring for Front-End Apps

This article explains the background, technical architecture, real‑world case, and key takeaways of implementing gray‑scale monitoring across web, Weex, mini‑programs, and other cross‑platform front‑end solutions to improve issue detection and reduce mean time to recovery.

Operationscross‑platformfrontend
0 likes · 10 min read
Mastering Gray-Scale Cross-Platform Monitoring for Front-End Apps
Aikesheng Open Source Community
Aikesheng Open Source Community
Jan 6, 2021 · Databases

MySQL Governance Practices at Industrial and Commercial Bank of China

This article details ICBC's extensive MySQL deployment—nearly ten thousand nodes supporting core A‑level applications—and outlines the bank's governance framework, including current challenges, standardized operational procedures, automation, containerization, and future self‑healing strategies to ensure reliable, high‑performance database services.

ContainerizationDatabase GovernanceOperations
0 likes · 16 min read
MySQL Governance Practices at Industrial and Commercial Bank of China
Continuous Delivery 2.0
Continuous Delivery 2.0
Jan 6, 2021 · Operations

Test In Production (TIP): Microsoft’s Shift‑Right Testing, Fault Injection, and Chaos Engineering Practices

The article explains Microsoft’s Test‑In‑Production (TIP) approach, describing why production is the only true environment, how they use gradual releases, feature flags, telemetry, fault injection, circuit‑breaker testing, and chaos engineering to improve reliability, micro‑service compatibility, and business continuity.

ChaosEngineeringFaultInjectionMicrosoft
0 likes · 11 min read
Test In Production (TIP): Microsoft’s Shift‑Right Testing, Fault Injection, and Chaos Engineering Practices
Architects' Tech Alliance
Architects' Tech Alliance
Jan 5, 2021 · Operations

Understanding Data Centers: Architecture, Technologies, and Operational Considerations

This article explains what data centers are, outlines their core components—compute, storage, and networking—covers architectural decisions, industry standards, and emerging technologies such as edge computing, micro‑data centers, cloud integration, SDN, HCI, containers, NVMe, and GPU acceleration, highlighting their impact on modern enterprise operations.

Edge ComputingGPUHCI
0 likes · 11 min read
Understanding Data Centers: Architecture, Technologies, and Operational Considerations
Top Architect
Top Architect
Jan 5, 2021 · Operations

Microservice Monitoring Architecture: Five‑Layer Hierarchy and Key Practices

The article explains the importance of microservice monitoring and presents a five‑level monitoring hierarchy—from infrastructure to end‑user experience—along with five essential monitoring aspects and a typical architecture using agents, message queues, ELK, and time‑series databases to ensure reliable, observable services.

MetricsMicroserviceOperations
0 likes · 6 min read
Microservice Monitoring Architecture: Five‑Layer Hierarchy and Key Practices
Architecture Digest
Architecture Digest
Jan 4, 2021 · Operations

Design and Implementation of a Gray Release System

This article explains the concept of gray release, outlines a simple architecture with essential components, describes common strategies such as header, cookie, and parameter based routing, and provides detailed implementation guidance for Nginx, gateway, and complex multi‑service scenarios.

A/B testingOperationsService Architecture
0 likes · 7 min read
Design and Implementation of a Gray Release System
DevOps Cloud Academy
DevOps Cloud Academy
Jan 2, 2021 · Operations

Understanding DevOps: Benefits, Differences from Traditional Ops, and Practical Implementation

This article explains what DevOps is, outlines its key benefits such as faster feedback loops, higher quality and reduced costs, compares it with traditional IT operations, and provides practical guidance on mindset shifts, infrastructure as code, automation, and tool selection for successful adoption.

CollaborationDevOpsInfrastructure as Code
0 likes · 10 min read
Understanding DevOps: Benefits, Differences from Traditional Ops, and Practical Implementation
21CTO
21CTO
Jan 2, 2021 · Operations

Designing & Operating Highly Available Scalable Systems: Google’s SRE Secrets

This article presents a comprehensive overview of Site Reliability Engineering (SRE) as shared by Google SRE expert Ramón Medrano Llamas, covering SRE fundamentals, a typical day’s workflow, design principles for massive scale, fault‑tolerant architecture, monitoring, SLI/SLO metrics, redundancy strategies, disaster recovery, and operational best practices.

OperationsSREScalable Systems
0 likes · 13 min read
Designing & Operating Highly Available Scalable Systems: Google’s SRE Secrets
Architect
Architect
Jan 2, 2021 · Operations

Layered Architecture of Microservice Monitoring and Key Practices

This article explains the layered architecture of microservice monitoring, detailing five monitoring levels—from infrastructure to end-user experience—along with essential monitoring points such as logs, metrics, tracing, alerts, and health checks, and presents a typical monitoring stack using agents, Kafka, ELK, and InfluxDB.

MetricsOperationslogging
0 likes · 6 min read
Layered Architecture of Microservice Monitoring and Key Practices
Youzan Coder
Youzan Coder
Dec 30, 2020 · Operations

ERROR Log Governance and Monitoring Alerting Practice at Youzan

Youzan’s log‑governance guide uses a car‑dashboard analogy to show why precise ERROR logs and sensible alerts matter, defines INFO/WARN/ERROR levels, sets daily reduction targets, leverages top‑error analysis and water‑level monitoring, and ultimately cut daily ERROR entries from thousands to about one hundred while catching issues before incidents.

AlertingError HandlingLog Management
0 likes · 9 min read
ERROR Log Governance and Monitoring Alerting Practice at Youzan
21CTO
21CTO
Dec 28, 2020 · Operations

Step-by-Step Guide to Install and Configure Jenkins with Supervisord on Linux

This tutorial walks you through downloading Jenkins, setting up JDK 1.8, installing and configuring Supervisord, creating the Jenkins directory, configuring supervisord to manage Jenkins, retrieving the initial admin password, and completing the web UI setup, all with clear command examples and screenshots.

DevOpsInstallationJenkins
0 likes · 4 min read
Step-by-Step Guide to Install and Configure Jenkins with Supervisord on Linux
Programmer DD
Programmer DD
Dec 26, 2020 · Operations

How Netflix’s Telltale Transforms Application Monitoring for 100+ Services

Netflix built the in‑house Telltale system to consolidate monitoring data, reduce alert fatigue, and provide intelligent, multi‑dimensional health assessments for over a hundred production applications, enabling faster incident resolution and more reliable streaming for its 200 million users.

NetflixOperations
0 likes · 13 min read
How Netflix’s Telltale Transforms Application Monitoring for 100+ Services
DevOps Cloud Academy
DevOps Cloud Academy
Dec 26, 2020 · Operations

Key DevOps Metrics for Effective Software Delivery

The article outlines essential DevOps metrics—such as deployment frequency, deployment time, automated test pass rate, code commit volume, defect escape rate, cost, failure rates, detection time, unplanned work, MTTF, application performance, MTTD, MTTR, delivery time, change quality, and customer feedback—to help teams monitor and improve software delivery speed, quality, and reliability.

DeploymentDevOpsMetrics
0 likes · 9 min read
Key DevOps Metrics for Effective Software Delivery
Architecture Digest
Architecture Digest
Dec 24, 2020 · Backend Development

WeChat Architecture: Strategies, Agile Practices, and Large‑Scale System Design

The article details WeChat’s three‑in‑one strategy of precise product, agile projects, and robust technical support, explaining how the team achieves massive scalability, high availability, extensible protocols, resilient disaster recovery, and embedded monitoring through practices like small‑system‑big‑scale, gray‑release, and foundational components.

BackendOperationsWeChat
0 likes · 17 min read
WeChat Architecture: Strategies, Agile Practices, and Large‑Scale System Design
Architect
Architect
Dec 23, 2020 · Operations

Design and Evaluation of Log Collection Agents: Flume vs Filebeat

This article analyses the shortcomings of traditional log‑collection agents, compares Flume and Filebeat based on low‑cost, stability, efficiency and lightweight criteria, and presents practical solutions for file discovery, offset tracking, multi‑line handling and performance tuning in modern logging pipelines.

Agent DesignFlumeOperations
0 likes · 13 min read
Design and Evaluation of Log Collection Agents: Flume vs Filebeat
Xianyu Technology
Xianyu Technology
Dec 22, 2020 · Operations

Comprehensive Message Traceability and Real-Time Log Processing for Xianyu

Xianyu’s new Message Quality Platform links client, API, and server logs by a unique messageId, cleans and clusters real‑time telemetry, correlates user behavior, and visualizes abnormal nodes, giving end‑to‑end traceability that cuts incident investigation time by over 90 % and can be applied to other pipelines.

Message TracingOperationsbackend reliability
0 likes · 8 min read
Comprehensive Message Traceability and Real-Time Log Processing for Xianyu
Efficient Ops
Efficient Ops
Dec 20, 2020 · Operations

Boost Server Performance: CPU, Memory, Disk, Network & Concurrency Optimizations

This article summarizes Tao Hui's 2020 GOPS Global Operations Conference talk, covering practical techniques for optimizing basic resources, improving network efficiency, reducing request latency, and scaling system concurrency to achieve higher throughput and lower latency in modern distributed services.

Operationsnetwork efficiencysystem resources
0 likes · 23 min read
Boost Server Performance: CPU, Memory, Disk, Network & Concurrency Optimizations
Java Architect Essentials
Java Architect Essentials
Dec 18, 2020 · Operations

An Out‑of‑the‑Box ELK‑Based Log and Metric Collection Solution for Private Deployments

This article presents a ready‑to‑use ELK‑based solution for private‑deployment environments, detailing design principles, rapid one‑click deployment via Jenkins, log and metric collection with Filebeat and Metricbeat, alerting using ElastAlert, and visualization in Kibana, while emphasizing simplicity, robustness, and minimal operational overhead.

ELKKibanaOperations
0 likes · 10 min read
An Out‑of‑the‑Box ELK‑Based Log and Metric Collection Solution for Private Deployments
Continuous Delivery 2.0
Continuous Delivery 2.0
Dec 18, 2020 · Operations

Applying the VALET Model for SRE Transformation at Home Depot (THD)

The article explains how Home Depot (THD) adopted the VALET model—a five‑dimensional SLO language covering Volume, Availability, Latency, Error, and Ticket—to unify communication, automate data collection, and improve reliability across its massive retail and e‑commerce infrastructure.

OperationsReliabilitySLO
0 likes · 9 min read
Applying the VALET Model for SRE Transformation at Home Depot (THD)
JD Cloud Developers
JD Cloud Developers
Dec 16, 2020 · Backend Development

How JD Logistics Scaled a Billion‑Level Async Messaging System for 11.11

JD Logistics architect Chen Haolong detailed the design, scalability strategies, and operational practices behind the billion‑level asynchronous messaging system that powered JD.com’s massive 11.11 shopping festival, revealing how the platform handled unprecedented traffic and ensured reliability.

JD LogisticsOperationsasync messaging
0 likes · 2 min read
How JD Logistics Scaled a Billion‑Level Async Messaging System for 11.11
Top Architect
Top Architect
Dec 16, 2020 · Operations

Six Rules of Thumb for Scaling Software Architectures

The article presents six practical guidelines for designing scalable software architectures, covering cost‑scalability trade‑offs, bottleneck identification, the dangers of slow services, database scaling challenges, the importance of caching, and the role of comprehensive monitoring to ensure reliable growth under heavy load.

OperationsScalabilitycaching
0 likes · 15 min read
Six Rules of Thumb for Scaling Software Architectures
Youzan Coder
Youzan Coder
Dec 15, 2020 · Industry Insights

How Youzan Built a Full‑Scale Data Cost Billing System: From SDK to Multi‑Dimensional Analysis

This article details Youzan's end‑to‑end construction of a unified data‑center cost billing system, covering background goals, multi‑type cost support, SDK‑based information collection, cost quantification for offline, real‑time and platform tools, full‑business coverage, multi‑dimensional analysis models, operational rollout, and future plans.

Big DataData PlatformIndustry Insights
0 likes · 19 min read
How Youzan Built a Full‑Scale Data Cost Billing System: From SDK to Multi‑Dimensional Analysis