Tagged articles

Operations

3329 articles · Page 24 of 34

Dec 23, 2019 · Operations

2019 IDCF DevOps Case Study Series: Insights from Facebook, Microsoft, Etsy and More

In 2019, IDCF organized three in‑depth DevOps case study events where participants analyzed the practices of leading companies such as Facebook, Microsoft, Etsy and others, highlighting hacker culture, automated Windows builds, and a five‑step tech transformation roadmap.

Case StudyEtsyFacebook

0 likes · 6 min read

2019 IDCF DevOps Case Study Series: Insights from Facebook, Microsoft, Etsy and More

Efficient Ops

Dec 22, 2019 · Operations

How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale

This article examines Baidu’s Noah monitoring and alarm platform, detailing its end‑to‑end fault‑handling workflow, the three‑component architecture, and the practical challenges of deploying AIOps—such as long algorithm iteration cycles, complex alarm management, and alarm storms—while highlighting scalability and commercial considerations.

AIOpsAlarm ManagementOperations

0 likes · 15 min read

How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale

Ctrip Technology

Dec 19, 2019 · Cloud Native

Evolution of Ctrip Cloud Platform: From OpenStack IAAS to Cloud‑Native Kubernetes

This article chronicles Ctrip's cloud‑infrastructure journey—from the early OpenStack‑based IAAS platform through containerization with Mesos, the migration to large‑scale Kubernetes clusters, and the adoption of cloud‑native practices that improve resource utilization, deployment speed, and application governance.

KubernetesOperationscloud-native

0 likes · 10 min read

Evolution of Ctrip Cloud Platform: From OpenStack IAAS to Cloud‑Native Kubernetes

Efficient Ops

Dec 18, 2019 · Operations

How CITIC Bank Pioneered Organizational‑Level Agile and DevOps Practices

CITIC Bank’s DevOps lead Li Hongtao explains how the bank’s new organizational‑level agile practice integrates development and data‑center operations, employs a DevOps capability maturity model, and cultivates agile coaches to overcome transformation challenges, offering a practical roadmap for peers in the banking sector.

AgileOperationsdigital transformation

0 likes · 5 min read

How CITIC Bank Pioneered Organizational‑Level Agile and DevOps Practices

Qunar Tech Salon

Dec 17, 2019 · Operations

Evolution of Call Center Technology: From Hotlines to Multimedia

This article traces the evolution of call center technology across four generations—from early hotlines using PSTN and PBX, through IVR and CTI innovations, to modern multimedia channels—highlighting key concepts, features, and their impact on operational efficiency and customer service.

CTIIVROperations

0 likes · 10 min read

Evolution of Call Center Technology: From Hotlines to Multimedia

Java Architect Essentials

Dec 15, 2019 · Backend Development

Designing Ultra‑High‑Performance Flash‑Sale Systems: Architecture, Consistency, and Availability

This article analyzes the core challenges of building flash‑sale (秒杀) systems—high concurrency reads and writes, strict consistency, and ultra‑high availability—and presents a layered architectural approach covering dynamic/static separation, hotspot optimization, database tuning, and comprehensive high‑availability strategies.

Flash SaleHigh AvailabilityOperations

0 likes · 28 min read

Designing Ultra‑High‑Performance Flash‑Sale Systems: Architecture, Consistency, and Availability

360 Quality & Efficiency

Dec 13, 2019 · Operations

Using Zabbix to Monitor Service Ports and Configure Email Alerts

This article explains how to use Zabbix for simple service‑port monitoring, covering installation, host and item creation, trigger and graph setup, and email notification configuration, providing a practical guide for developers who need lightweight operational monitoring without writing custom code.

Email NotificationOperationsService Port

0 likes · 8 min read

Using Zabbix to Monitor Service Ports and Configure Email Alerts

DevOpsClub

Dec 12, 2019 · Operations

Avoid Common Pitfalls in R&D Efficiency Metrics: Principles, Cases, and Implementation

This article examines common mistakes in measuring R&D efficiency, outlines seven guiding principles, presents case studies from leading tech companies, and shares practical steps and lessons for implementing a comprehensive R&D metrics system.

OperationsR&D efficiencydevops

0 likes · 3 min read

Avoid Common Pitfalls in R&D Efficiency Metrics: Principles, Cases, and Implementation

Aikesheng Open Source Community

Dec 12, 2019 · Databases

Financial Industry MySQL High‑Availability Practices – Interview with Ming Xiyuan

In this interview, Ming Xiyuan, Technical Solutions Director at iKangSheng, shares practical guidance on selecting and deploying high‑availability MySQL solutions for the financial sector, covering planning pitfalls, hardware considerations, key configuration parameters, and the company’s product roadmap.

CloudHigh AvailabilityMySQL

0 likes · 5 min read

Financial Industry MySQL High‑Availability Practices – Interview with Ming Xiyuan

MaGe Linux Operations

Dec 10, 2019 · Operations

Master JMeter Distributed Load Testing: Setup, Assertions, and Performance Analysis

This guide walks operations engineers through understanding server bottlenecks, setting up JMeter's distributed load testing environment on Windows and Linux, configuring assertions and variables, analyzing performance results, and monitoring both concurrency and stability tests to ensure reliable scalability.

JMeterOperationsVariables

0 likes · 13 min read

Master JMeter Distributed Load Testing: Setup, Assertions, and Performance Analysis

Architects Research Society

Dec 9, 2019 · Operations

Overview of StackStorm: An Open‑Source Automation Platform

StackStorm is an open‑source automation platform that integrates existing infrastructure and applications, enabling event‑driven workflows, troubleshooting, auto‑remediation, and continuous deployment through modular components such as sensors, triggers, actions, rules, workflows, and packs, all managed via a web UI, CLI, and REST API.

AutomationOperationsdevops

0 likes · 7 min read

Overview of StackStorm: An Open‑Source Automation Platform

NetEase Game Operations Platform

Dec 7, 2019 · Operations

Intelligent Anomaly Detection for Operations Maintenance: Machine Learning Methods and Workflow

This article explains the importance of operations maintenance, outlines the challenges of traditional rule‑based anomaly detection, and describes how machine‑learning‑driven AIOps—including feature engineering, unsupervised and supervised models—can provide more accurate, scalable, and automated detection of server anomalies.

AIOpsOperationsfeature engineering

0 likes · 10 min read

Intelligent Anomaly Detection for Operations Maintenance: Machine Learning Methods and Workflow

MaGe Linux Operations

Dec 5, 2019 · Operations

When Alipay Crashed: Lessons on High Availability and Disaster Recovery

On December 5th Alipay experienced a brief outage that sent users into panic, prompting a humorous recount of personal losses, meme images, and a reminder of the critical importance of high‑availability architecture and disaster‑recovery planning for large‑scale financial services.

Alipay outageDisaster RecoveryOperations

0 likes · 3 min read

When Alipay Crashed: Lessons on High Availability and Disaster Recovery

21CTO

Dec 3, 2019 · Operations

Why Most Alerts Fail and How to Build Actionable Monitoring

This article explains why many system alerts are poorly designed, describes the true purpose of alerts as actionable notifications, distinguishes business rule monitoring from reliability monitoring, and presents practical metrics, strategies, and simple anomaly‑detection algorithms to create high‑quality, actionable alerts for reliable operations.

AlertingAnomaly DetectionOperations

0 likes · 23 min read

Why Most Alerts Fail and How to Build Actionable Monitoring

Youku Technology

Dec 2, 2019 · Operations

Technical Architecture and Operational Practices of Alibaba's 2019 Double‑11 "Cat Evening" Live Show

Alibaba’s 2019 Double‑11 “Cat Evening” live show combined a unified codebase across Youku, Taobao and Tmall with synchronized clocks, latency‑measurement devices and SEI‑injected messages to guarantee fair, zero‑loss interactions, while employing dynamic routing, pre‑warming, peak‑shaving, downstream protection and rehearsed contingency plans to handle massive concurrency and ensure stable, high‑quality user experience.

AlibabaOperationsfairness

0 likes · 11 min read

Technical Architecture and Operational Practices of Alibaba's 2019 Double‑11 "Cat Evening" Live Show

Architecture Digest

Nov 29, 2019 · Operations

Evolution and Optimization of JD.com’s Order Center Elasticsearch Cluster Architecture

This article details how JD.com’s order center migrated its Elasticsearch cluster from a basic, mixed‑cloud deployment to a real‑time dual‑cluster backup solution, covering each architectural stage, scaling decisions, data‑sync strategies, and the performance pitfalls encountered along the way.

Cluster ArchitectureJD.comOperations

0 likes · 14 min read

Evolution and Optimization of JD.com’s Order Center Elasticsearch Cluster Architecture

Alibaba Cloud Developer

Nov 29, 2019 · Operations

How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration

This article explains how Alibaba's Elastic Compute Service (ECS) SRE team tackled massive traffic, database bottlenecks, alert overload, and resource inconsistencies by establishing a full‑stack reliability organization, upgrading core components, automating pipelines, and instituting rigorous monitoring, incident response, and change‑management processes.

OperationsSRESite Reliability Engineering

0 likes · 27 min read

How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration

Youku Technology

Nov 26, 2019 · Operations

Resource Assurance Strategies and Practices for Alibaba Youku Double‑11 Promotion

The article outlines Alibaba Youku’s end‑to‑end resource‑assurance platform for Double‑11 promotions, detailing automated demand collection, business‑to‑technical metric conversion, single‑machine capacity testing, rapid scaling and emergency borrowing, which together cut manual reviews by 80 % and boosted delivery efficiency tenfold.

AutomationOperationsResource Management

0 likes · 13 min read

Resource Assurance Strategies and Practices for Alibaba Youku Double‑11 Promotion

Architect's Tech Stack

Nov 23, 2019 · Databases

Redis Usage Guidelines and Operational Restrictions

This article provides comprehensive best‑practice guidelines for using Redis, covering data classification, key naming, size and connection limits, cache TTL, recommended client‑hash sharding, and a strict list of prohibited commands and operations to ensure performance, reliability, and maintainability.

CacheOperationsPerformance

0 likes · 9 min read

Redis Usage Guidelines and Operational Restrictions

Programmer DD

Nov 23, 2019 · Operations

Essential Checklist for Rapid Server Troubleshooting

This guide walks you through a systematic, step‑by‑step process for diagnosing and resolving poor‑performance or failure incidents on Linux servers, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O, logs, cron jobs and application‑level diagnostics.

LinuxOperationsPerformance

0 likes · 11 min read

Essential Checklist for Rapid Server Troubleshooting

JavaEdge

Nov 22, 2019 · Operations

How to Install and Configure Elasticsearch and Kibana on macOS with Homebrew

This step‑by‑step guide shows how to install JDK, set up Elasticsearch via Homebrew, verify its operation, optionally adjust the cluster name, install Kibana, and access the Kibana web UI for managing Elasticsearch data.

ElasticsearchHomebrewInstallation

0 likes · 2 min read

How to Install and Configure Elasticsearch and Kibana on macOS with Homebrew

Ctrip Technology

Nov 21, 2019 · Cloud Native

Case Study: Intermittent Container Timeout Issues – Analysis and Resolution

This article presents a detailed case study of intermittent container timeout problems in a Kubernetes environment, examining kernel upgrades, NUMA configurations, CPU affinity bindings, kubelet behavior, cadvisor overhead, and hardware faults, and outlines the investigative steps and solutions applied.

CPU affinityHardware FaultKubernetes

0 likes · 8 min read

Case Study: Intermittent Container Timeout Issues – Analysis and Resolution

MaGe Linux Operations

Nov 20, 2019 · Operations

Essential Shell Script Best Practices for Reliable Automation

This article outlines the evolution from manual to automated operations, then presents a comprehensive set of shell‑script guidelines—including header conventions, formatting, safety checks, variable handling, loop pitfalls, logging, concurrency locks, and risk‑avoidance techniques—to help engineers write robust, maintainable automation scripts.

Best PracticesOperationsShell Scripting

0 likes · 10 min read

Essential Shell Script Best Practices for Reliable Automation

Alibaba Cloud Developer

Nov 18, 2019 · Operations

How Alipay’s Tech Team Turned ‘Impossible’ Double‑11 Peaks into Seamless Transactions

Over eleven years Alipay’s engineers transformed the daunting Double‑11 traffic surges from chaotic outages into a smooth, scalable system through relentless capacity planning, architectural revolutions, rigorous stress testing, and the adoption of the self‑developed OceanBase database, turning “impossible” goals into everyday reality.

AlipayDatabasesDouble11

0 likes · 23 min read

How Alipay’s Tech Team Turned ‘Impossible’ Double‑11 Peaks into Seamless Transactions

Architecture Digest

Nov 16, 2019 · Operations

What Happens If Alipay’s Data Centers Are Physically Destroyed? A Deep Dive into Redundancy and Disaster Recovery

The article examines how Alipay’s financial data would survive a physical destruction of its servers by explaining multi‑site data center architectures, hot and cold backups, power redundancy, fire‑suppression systems, and the role of partner banks in data recovery, highlighting the extensive resilience measures in modern financial infrastructures.

AlipayData CenterDisaster Recovery

0 likes · 8 min read

What Happens If Alipay’s Data Centers Are Physically Destroyed? A Deep Dive into Redundancy and Disaster Recovery

DevOps Cloud Academy

Nov 14, 2019 · Operations

Speeding Up Jenkins Plugin Downloads in China with the TUNA Mirror and Update‑Center Fixes

This article explains why Jenkins plugin downloads are slow for Chinese users, introduces the TUNA mirror service, details the signed update‑center.json mechanism, and provides a step‑by‑step solution—including configuration changes and open‑source tools—to achieve fast, reliable plugin installation.

CI/CDChinaJenkins

0 likes · 5 min read

Speeding Up Jenkins Plugin Downloads in China with the TUNA Mirror and Update‑Center Fixes

Ctrip Technology

Nov 14, 2019 · Operations

Chaos Engineering: Principles, Practices, and Lessons from Ctrip

The article explains Chaos Engineering as a discipline for deliberately injecting failures into distributed systems to uncover hidden weaknesses, outlines its five core principles, describes practical implementation steps and real‑world examples from Ctrip, and discusses future directions for reliability engineering.

Fault InjectionOperationsReliability

0 likes · 9 min read

Chaos Engineering: Principles, Practices, and Lessons from Ctrip

21CTO

Nov 12, 2019 · Backend Development

Why API Gateways Are Essential for Secure, Efficient Microservices

Implementing an API gateway in a microservice architecture addresses key challenges such as external‑internal isolation, backend security, operational cost reduction, streamlined change processes, and client‑service decoupling, while enabling features like service circuit breaking, gray releases, and online testing to simplify development and improve reliability.

OperationsService Architecturemicroservices

0 likes · 3 min read

Why API Gateways Are Essential for Secure, Efficient Microservices

Python Programming Learning Circle

Nov 10, 2019 · Operations

What Happens If Alipay’s Servers Are Bombed? Inside Data Center Redundancy

The article explains how financial platforms like Alipay protect user funds through multi‑site data centers, hot and cold backups, power redundancy, fire‑suppression systems, and strict location standards, showing why destroying a single server would not erase all stored money.

Data CenterDisaster RecoveryOperations

0 likes · 9 min read

What Happens If Alipay’s Servers Are Bombed? Inside Data Center Redundancy

Qunhe Technology Quality Tech

Nov 9, 2019 · Operations

How We Cut BIM Drawing Failures from 0.01% to 0.0005% with Automated Monitoring

The BIM construction‑drawing team built an automated monitoring and validation tool using Spring Boot, REST‑Assured and JIRA APIs, turning a tedious manual bug‑fix workflow into a streamlined process that reduced online drawing‑failure rates from 0.01% to virtually zero.

AutomationBIMJira

0 likes · 5 min read

How We Cut BIM Drawing Failures from 0.01% to 0.0005% with Automated Monitoring

JD Retail Technology

Nov 8, 2019 · Operations

Smart Supply Chain Operations for JD.com 11.11 Promotion: Integrated Planning, AI‑Driven Forecasting, and Real‑Time Optimization

JD.com's Smart Supply Chain Y Business Management team collaborated across divisions to implement AI‑driven demand forecasting, automated replenishment, micro‑service architecture, and real‑time monitoring, enabling precise inventory control, cost reduction, and seamless 11.11 promotion fulfillment through integrated planning, pricing, and fulfillment innovations.

AIOperationsdemand forecasting

0 likes · 21 min read

Smart Supply Chain Operations for JD.com 11.11 Promotion: Integrated Planning, AI‑Driven Forecasting, and Real‑Time Optimization

Liangxu Linux

Nov 7, 2019 · Operations

Monitor Linux Processes with a Simple Shell Script

This guide shows how to create a reusable shell function that retrieves a process ID for a given user and program, demonstrates its usage, and explains each command involved so you can reliably detect when a service stops running.

AutomationOperationsScript

0 likes · 5 min read

Monitor Linux Processes with a Simple Shell Script

JD Retail Technology

Nov 7, 2019 · Operations

7FRESH Technical Preparation for the 11.11 Shopping Festival: System Scaling, Degradation Strategies, Emergency Plans, Performance Testing, and Operational Monitoring

The article details how 7FRESH's R&D, testing, network operations, and product teams coordinated system capacity expansion, degradation mechanisms, emergency response procedures, extensive performance testing, and 24/7 monitoring to ensure stable and scalable service during the high‑traffic 11.11 shopping event.

Operationscapacity planningdegradation

0 likes · 10 min read

7FRESH Technical Preparation for the 11.11 Shopping Festival: System Scaling, Degradation Strategies, Emergency Plans, Performance Testing, and Operational Monitoring

JD Retail Technology

Nov 6, 2019 · Artificial Intelligence

Technical Overview of JD.com Search and Recommendation Systems for the 11.11 Shopping Festival

The article details JD.com's internally developed distributed search engine and recommendation platform, their new architectures, deep‑learning‑driven ranking and recall models, component‑based deployment, extensive performance testing, and coordinated operations that powered the massive 11.11 shopping event.

OperationsSearch Enginedeep learning

0 likes · 5 min read

Technical Overview of JD.com Search and Recommendation Systems for the 11.11 Shopping Festival

DevOps Cloud Academy

Nov 6, 2019 · Operations

Using Jenkins Pipeline and Groovy to Simplify CI/CD for Multi‑Service Environments

Jenkins pipelines, defined as Groovy code, enable version‑controlled, reusable CI/CD jobs for complex multi‑service environments, simplifying repetitive tasks such as URL updates, providing change tracking, and automating builds, tests, and releases across branches via Jenkinsfile.

CI/CDGroovyJenkins

0 likes · 3 min read

Using Jenkins Pipeline and Groovy to Simplify CI/CD for Multi‑Service Environments

Architects Research Society

Nov 5, 2019 · Backend Development

Principled GraphQL: Ten Principles for Building, Maintaining, and Operating Data Graphs

This article presents ten GraphQL principles—grouped into integrity, agility, and operations—that guide the design, evolution, and secure large‑scale deployment of a unified data‑graph layer, emphasizing a single schema, collaborative implementation, schema registries, performance monitoring, and structured logging.

API designData GraphGraphQL

0 likes · 17 min read

Principled GraphQL: Ten Principles for Building, Maintaining, and Operating Data Graphs

360 Zhihui Cloud Developer

Nov 5, 2019 · Operations

How 360 Scaled AIOps: From Data to Self‑Healing Operations

At the 360 Internet Technology Training Camp, experts detailed how AI-driven AIOps can transform large‑scale operations, covering data collection, model‑based anomaly detection, alert correlation, self‑healing workflows, and visual dashboards, and presented a practical end‑to‑end framework that other companies can adopt quickly.

AIOpsBig DataMachine Learning

0 likes · 15 min read

How 360 Scaled AIOps: From Data to Self‑Healing Operations

Java Architecture Diary

Nov 5, 2019 · Databases

Seamless Redis Migration: RDB/AOF, redis-dump, RedisShake & Docker Strategies

This guide outlines step‑by‑step methods for migrating a production Redis cluster—including using Redis’s native RDB/AOF backups, JSON‑based redis‑dump, the Alibaba‑maintained RedisShake tool, and Docker‑based cluster setups—ensuring data integrity and minimal downtime during the transition.

DockerOperationsRedis

0 likes · 4 min read

Seamless Redis Migration: RDB/AOF, redis-dump, RedisShake & Docker Strategies

Programmer DD

Nov 5, 2019 · Operations

Why GitLab Banned Chinese and Russian Users: Inside the New Country‑Lock Policy

GitLab announced a "work country/region lock" that bars Chinese and Russian citizens from receiving offers and restricts employees with customer‑data access from relocating to those countries, sparking heated debate over open‑source non‑discrimination and geopolitical pressures.

GitLabOperationscountry lock

0 likes · 8 min read

Why GitLab Banned Chinese and Russian Users: Inside the New Country‑Lock Policy

Efficient Ops

Nov 5, 2019 · Operations

From Waterfall to AIOps: How One Ops Leader Transformed Zhejiang Mobile’s IT

In an in‑depth interview, Fang Wei, former assistant general manager of Zhejiang Mobile’s network department, shares his 15‑year journey from B‑domain maintenance to leading DevOps, cloud, and AIOps initiatives, detailing the shift from waterfall processes to agile, micro‑services, containerization, and AI‑driven operations that reshaped the company’s IT landscape.

AIOpsAgileOperations

0 likes · 15 min read

From Waterfall to AIOps: How One Ops Leader Transformed Zhejiang Mobile’s IT

58 Tech

Nov 4, 2019 · Operations

Intelligent Operations Practices: Multi‑Dimensional Anomaly Detection, Alarm Merging, Knowledge‑Graph Construction, and Root‑Cause Analysis

This article summarizes the keynote on intelligent operations presented at the 13th GOPS Global Operations Conference, covering multi‑dimensional anomaly detection, smart alarm aggregation, the construction of an operations knowledge graph, and AI‑driven root‑cause analysis techniques for large‑scale server environments.

Anomaly DetectionOperationsRoot Cause Analysis

0 likes · 9 min read

Intelligent Operations Practices: Multi‑Dimensional Anomaly Detection, Alarm Merging, Knowledge‑Graph Construction, and Root‑Cause Analysis

Efficient Ops

Nov 3, 2019 · Operations

How Zhejiang Mobile Is Pioneering AIOps to Reach NoOps

Zhejiang Mobile’s IT department chronicles its journey from a 2015 cloud‑native initiative to a cutting‑edge AIOps transformation, detailing a six‑level NoOps roadmap, digital fault‑governance, middle‑platform consolidation, organizational agility, and measurable operational gains that position it as a telecom industry leader.

AIOpsArtificial IntelligenceBig Data

0 likes · 7 min read

How Zhejiang Mobile Is Pioneering AIOps to Reach NoOps

JD Retail Technology

Oct 31, 2019 · Operations

Collaborative Load Testing for JD.com 11.11 Event: Organizational Changes, Scale Expansion, and ForceBot Traffic Recording & Replay

The article details JD.com's coordinated effort to prepare for the 11.11 shopping festival by expanding load‑testing scale, improving cross‑team collaboration, and enhancing the ForceBot platform with traffic recording and replay capabilities to achieve more realistic and efficient full‑chain performance evaluations.

JD.comOperationsforcebot

0 likes · 7 min read

Collaborative Load Testing for JD.com 11.11 Event: Organizational Changes, Scale Expansion, and ForceBot Traffic Recording & Replay

dbaplus Community

Oct 28, 2019 · Operations

Avoid Common Prometheus Pitfalls: Best Practices for Reliable Monitoring

This article shares practical Prometheus best‑practice tips, covering the accuracy‑reliability trade‑off, self‑monitoring setups, avoiding NFS storage, pruning high‑cardinality metrics, handling rate‑function traps, alert‑graph mismatches, group_interval effects, and the overarching goal of stable, cost‑effective observability.

AlertingBest PracticesOperations

0 likes · 9 min read

Avoid Common Prometheus Pitfalls: Best Practices for Reliable Monitoring

21CTO

Oct 28, 2019 · Operations

What Keeps Aviation IT Safe? Lessons from System Design and Data‑Driven Ops

The article reflects on the challenges of modernizing aviation IT systems, highlighting safety‑first regulations, the lack of plug‑in architecture, the need for robust load‑balancing and fault‑tolerance, and how data‑driven automation can bridge the gap between strict oversight and efficient operations.

Data-DrivenOperationsaviation

0 likes · 13 min read

What Keeps Aviation IT Safe? Lessons from System Design and Data‑Driven Ops

Sohu Tech Products

Oct 23, 2019 · Operations

Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue

This article examines how Google SRE limits weekly alerts to ten, compares it with typical Chinese internet operations teams, and provides practical strategies—including on‑call scheduling, alert escalation, automation, dashboard optimization, and team management—to dramatically reduce alert volume and improve incident response.

Alert ManagementOn-CallOperations

0 likes · 15 min read

Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue

Tencent Cloud Developer

Oct 23, 2019 · Fundamentals

Building Programmer Soft Skills: Insights from a 2100-Day Technical Sharing Journey

Senior database engineer Yang Jianrong recounts his 2,100‑day daily technical‑sharing journey, emphasizing programmer soft‑skills—steady learning, clear communication, and open mindset—while offering four practical tips on planning, avoiding technical silos, using fragmented time, and engaging communities, and outlining future focus on AIOps, modern languages, and advanced database technologies.

Operationscareer-developmentdatabase-engineer

0 likes · 11 min read

Building Programmer Soft Skills: Insights from a 2100-Day Technical Sharing Journey

JD Retail Technology

Oct 22, 2019 · Industry Insights

How JD.com Prepares Its Systems for 11.11: Stress Tests, Forcebot Evolution, and Quality Controls

JD.com's Retail Technology and Data Platform orchestrated a full‑chain, four‑entry‑point stress test for the 11.11 shopping festival, introduced an upgraded Forcebot traffic‑recording tool, and implemented a "Quality Month" with ten safeguards to ensure system stability and prevent incidents during the massive sales event.

Operationsdevopse‑commerce

0 likes · 7 min read

How JD.com Prepares Its Systems for 11.11: Stress Tests, Forcebot Evolution, and Quality Controls

DevOps Cloud Academy

Oct 19, 2019 · Operations

Resolving Common SonarQube Platform Issues: Data Instability, Rule Configuration, and Project Authorization

This article explains how to address three common SonarQube challenges—data instability across branches, difficulty assigning quality profiles, and project permission management—by creating per‑branch projects, using Jenkins pipeline scripts with Sonar REST APIs, and applying permission templates to streamline large‑scale code‑quality scanning.

AutomationCI/CDJenkins

0 likes · 7 min read

Resolving Common SonarQube Platform Issues: Data Instability, Rule Configuration, and Project Authorization

Alibaba Cloud Native

Oct 16, 2019 · Cloud Native

Master the Distributed Systems Knowledge Map: From SOA to MSA and Beyond

This comprehensive guide walks you through the fundamentals, design patterns, consistency models, core components, and engineering practices of modern distributed systems, helping you understand micro‑service architecture, network protocols, data management, fault tolerance, and performance optimization in cloud‑native environments.

Operationsarchitecturecloud-native

0 likes · 32 min read

Master the Distributed Systems Knowledge Map: From SOA to MSA and Beyond

Java Captain

Oct 12, 2019 · Operations

Curated List of Free Technical Books Covering Linux, System Administration, Networking, and More

This article presents a curated collection of over a hundred free technical books—including Linux command‑line guides, system‑administration manuals, computer‑networking textbooks, and Docker tutorials—complete with brief descriptions, download links, and the impressive GitHub star and fork statistics of the source project.

LinuxOperationsfree books

0 likes · 8 min read

Curated List of Free Technical Books Covering Linux, System Administration, Networking, and More

Efficient Ops

Oct 9, 2019 · Operations

From IT Maintenance to IT Operations: Why the Shift Matters

This article explores the nuanced differences between IT maintenance (IT运维) and IT operations (IT运营), explaining how organizations transition from merely keeping systems alive to delivering high‑quality, business‑centric services that satisfy users, executives, and IT staff alike.

AutomationIT OperationsOperations

0 likes · 19 min read

From IT Maintenance to IT Operations: Why the Shift Matters

Architects' Tech Alliance

Oct 9, 2019 · Operations

Understanding Linux Virtual Server (LVS) Load Balancing: Principles, Implementation Methods, and Scheduling Algorithms

This article explains the role of load balancers in large-scale internet applications, introduces Linux Virtual Server (LVS) as a four‑layer software load‑balancing solution, describes its architecture, NAT/TUN/DR forwarding methods, and details various static and dynamic scheduling algorithms such as Round Robin, Weighted Least‑Connection, and locality‑based strategies.

LVSLinuxNetwork

0 likes · 11 min read

Understanding Linux Virtual Server (LVS) Load Balancing: Principles, Implementation Methods, and Scheduling Algorithms

Efficient Ops

Oct 8, 2019 · Operations

Build a Docker Container Monitoring Stack with CAdvisor, InfluxDB, Grafana

To effectively monitor Dockerized services, this guide walks through selecting a monitoring solution, deploying CAdvisor, integrating it with InfluxDB for persistent storage, visualizing metrics via Grafana, and addressing common issues such as missing utilities, memory stats, and network traffic inaccuracies.

GrafanaInfluxDBOperations

0 likes · 15 min read

Build a Docker Container Monitoring Stack with CAdvisor, InfluxDB, Grafana

DevOps Cloud Academy

Oct 7, 2019 · Operations

GitLab High Availability Solution with DRBD

This guide details a step‑by‑step setup of a highly available GitLab service using two virtual machines, DRBD for block‑level replication, configuration of GitLab and PostgreSQL directories, DRBD resource creation, service start‑up, and manual primary‑secondary failover procedures.

DRBDGitLabHigh Availability

0 likes · 8 min read

GitLab High Availability Solution with DRBD

MaGe Linux Operations

Sep 28, 2019 · Operations

Master IT Monitoring: Functions, Types, Layers & Top Tools (Zabbix vs Prometheus)

This article explains the essential functions of IT monitoring systems, classifies them into log, trace, and metric types, describes a five‑layer monitoring architecture, and compares two popular open‑source solutions—Zabbix and Prometheus—helping practitioners choose the right tool for their environment.

IT monitoringObservabilityOperations

0 likes · 17 min read

Master IT Monitoring: Functions, Types, Layers & Top Tools (Zabbix vs Prometheus)

37 Interactive Technology Team

Sep 27, 2019 · Operations

Centralized Management of Cron Jobs: Challenges and Solutions

The article outlines how a company built a centralized cron‑job platform—using Python’s crontab library, SaltStack deployment, ELK log aggregation, and automated email alerts—to integrate existing tasks, provide reliable CRUD operations, enable fast log querying, and detect failures, cutting operational overhead while managing thousands of scheduled jobs across multiple servers.

OperationsPythoncron

0 likes · 8 min read

Centralized Management of Cron Jobs: Challenges and Solutions

Liangxu Linux

Sep 25, 2019 · Operations

Understanding Linux Load Average: What the Numbers Really Mean

This article explains what Linux load average measures, how to read the 1‑, 5‑, and 15‑minute values, what they indicate on single‑core and multi‑core systems, and which thresholds should raise alerts for system performance monitoring.

LinuxOperationsPerformance

0 likes · 6 min read

Understanding Linux Load Average: What the Numbers Really Mean

转转QA

Sep 25, 2019 · Operations

Comprehensive Testing Strategies for Advertising Recall Systems

The article outlines a complete testing framework for advertising recall services, analyzing three demand types, defining testing focus for each, and presenting tools for log comparison, recall result verification, result comparison, and batch regression to ensure high‑quality ad delivery and revenue stability.

AdvertisingOperationsbackend

0 likes · 8 min read

Comprehensive Testing Strategies for Advertising Recall Systems

Efficient Ops

Sep 23, 2019 · Operations

How to Build an Effective CMDB for Scalable Operations Management

This article explains the step‑by‑step process of constructing a configuration management database (CMDB) for operations, covering resource modeling, data integration, organizational structures, maintenance methods, and how a well‑designed CMDB supports higher‑level business operations such as automation, visualization, and capacity planning.

AutomationCMDBITIL

0 likes · 14 min read

How to Build an Effective CMDB for Scalable Operations Management

Architects Research Society

Sep 23, 2019 · Operations

Curated List of Open‑Source Workflow Engines and BPM Tools

This article presents a comprehensive, categorized list of open‑source workflow engines and BPM tools—including Airflow, Argo, Cadence, Camunda, and many others—detailing their primary features and typical use cases for orchestration, data pipelines, and micro‑service coordination.

AutomationEngineOperations

0 likes · 4 min read

Curated List of Open‑Source Workflow Engines and BPM Tools

Efficient Ops

Sep 22, 2019 · Operations

How Experts Refined the DevOps Technical Operations Assessment Method

The September 2019 workshop convened over 30 DevOps specialists from leading Chinese enterprises to review and improve the evaluation method for the DevOps Capability Maturity Model Part 4: Technical Operations, resulting in a more complete and standardized assessment framework.

Assessment MethodOperationsStandardization

0 likes · 3 min read

How Experts Refined the DevOps Technical Operations Assessment Method

Programmer DD

Sep 20, 2019 · Operations

Master Prometheus: Key Features, Architecture, and Query Essentials

This article introduces Prometheus, an open‑source cloud‑native monitoring and alerting system, covering its main characteristics, core components, architecture diagram, typical use cases, query language syntax, built‑in functions, time‑series types, and practical tips for reliable operation.

AlertingOperationsPromQL

0 likes · 9 min read

Master Prometheus: Key Features, Architecture, and Query Essentials

Efficient Ops

Sep 18, 2019 · Databases

Why the DBA Role Is Becoming a Narrowed, High‑Risk Career Path

The article analyzes how the DBA job market is shrinking as traditional enterprises shift away from legacy systems, cloud adoption reshapes responsibilities, and DBAs face limited advancement unless they transition to architecture or data‑analytics roles, highlighting the growing risk and low reward of staying in pure DBA work.

Big DataDBADatabase Administration

0 likes · 7 min read

Why the DBA Role Is Becoming a Narrowed, High‑Risk Career Path

Efficient Ops

Sep 15, 2019 · Operations

Why Ops Needs a Project‑Management Mindset: Lessons from a Simple RAID Change

The article shares practical Ops insights, using a simple RAID change incident to illustrate why operations teams must understand change background, choose optimal timing, act as project managers, and follow a structured change process to protect production environments.

Change ManagementOperationsincident response

0 likes · 8 min read

Why Ops Needs a Project‑Management Mindset: Lessons from a Simple RAID Change

DevOps Cloud Academy

Sep 14, 2019 · Operations

Step-by-Step Installation and Configuration of Elasticsearch, Logstash, and Kibana on CentOS 7

This guide details the prerequisites, RPM-based installation, configuration files, and startup procedures for Elasticsearch, Logstash, and Kibana on a CentOS 7 virtual machine, including verification steps to confirm each component is running correctly.

CentOSElastic StackElasticsearch

0 likes · 4 min read

Step-by-Step Installation and Configuration of Elasticsearch, Logstash, and Kibana on CentOS 7

Big Data Technology Architecture

Sep 9, 2019 · Operations

Investigation and Resolution of Partial Queue Consumption after RocketMQ Topic Expansion

This article details a real‑world RocketMQ case where expanding a topic's queue count caused two consumer groups to miss messages on one broker, explains the root cause of missing subscription metadata after cluster scaling, and outlines the manual steps taken to restore full consumption.

Consumer LagMessage QueueOperations

0 likes · 8 min read

Investigation and Resolution of Partial Queue Consumption after RocketMQ Topic Expansion

DevOps Cloud Academy

Sep 8, 2019 · Operations

SSO and WebHook Integration Guide for GitLab and Jenkins

This guide details step‑by‑step configurations for integrating Single Sign‑On (SSO) and WebHook between GitLab and Jenkins, covering GitLab application setup, Jenkins backup and proxy adjustments, plugin installation, token generation, and testing the connection to ensure successful builds.

CI/CDGitLabJenkins

0 likes · 2 min read

SSO and WebHook Integration Guide for GitLab and Jenkins

DevOps Cloud Academy

Sep 8, 2019 · Operations

Jenkins User, Credential, and Permission Management Guide

This guide explains how to configure Jenkins user management, credential storage, and permission settings, covering entry points, LDAP/GitLab integration, credential types, and role-based access control with detailed steps and visual illustrations for administrators.

JenkinsOperationscredentials

0 likes · 4 min read

Jenkins User, Credential, and Permission Management Guide

DevOps Cloud Academy

Sep 8, 2019 · Operations

Project Management Guidelines and Jenkins Pipeline Setup

This guide outlines project naming conventions and step‑by‑step instructions for creating a new Jenkins project, configuring build history, parameterized builds, triggers, Jenkinsfile, and how to build, view logs, and debug the pipeline, illustrated with screenshots.

CI/CDJenkinsNaming Convention

0 likes · 2 min read

Project Management Guidelines and Jenkins Pipeline Setup

Efficient Ops

Sep 8, 2019 · Operations

Inside GNSEC 2023: How DevOps Leaders Accelerate Cloud and Digital Transformation

The one‑day GNSEC Global New‑Generation Software Engineering Summit gathered senior experts from major banks, tech giants and research institutes to showcase DevOps, cloud‑native, and digital‑transformation practices through a series of insightful talks, live demos, and award ceremonies, highlighting concrete case studies and emerging standards.

Operationsconferencedigital transformation

0 likes · 9 min read

Inside GNSEC 2023: How DevOps Leaders Accelerate Cloud and Digital Transformation

DevOps Cloud Academy

Sep 7, 2019 · Operations

Step-by-Step Guide to Installing an OpenShift 3.11 Cluster on CentOS VMs

This guide details the preparation, configuration, and deployment steps for setting up an OpenShift 3.11 cluster on three CentOS 7.6 virtual machines, covering host mapping, SSH key setup, OS updates, image loading, Ansible playbooks, and troubleshooting common issues.

AnsibleCentOSKubernetes

0 likes · 10 min read

Step-by-Step Guide to Installing an OpenShift 3.11 Cluster on CentOS VMs

DevOps Cloud Academy

Sep 5, 2019 · Operations

An Overview of the Prometheus Monitoring System

Prometheus, an open‑source monitoring and alerting toolkit originally developed by SoundCloud and now a CNCF project, offers multidimensional data models, flexible queries, pull‑based data collection, various metric types (counter, gauge, summary, histogram), local and remote storage, service discovery, and integrates with Grafana for visualization.

ObservabilityOperationsPrometheus

0 likes · 8 min read

An Overview of the Prometheus Monitoring System

JD Retail Technology

Sep 2, 2019 · Operations

How a Real‑Time H5 Monitoring Platform Solves E‑Commerce Activity Issues

Facing frequent user complaints about broken, slow, or misleading H5 activity pages, JD’s massive e‑commerce operations categorize issues into four types and deploy the Woodpecker platform—a scalable, real‑time monitoring and analysis system that pre‑detects configuration errors, server faults, development bugs, and minor UX flaws, while offering extensible, configurable alerts and historical scans.

AIH5 monitoringOperations

0 likes · 15 min read

How a Real‑Time H5 Monitoring Platform Solves E‑Commerce Activity Issues

DevOps Coach

Aug 29, 2019 · Operations

Benchmark Your DevOps Performance with the 2019 Accelerate Report

This article walks you through the key findings of the 2019 Accelerate DevOps State of the Industry report, explains the four golden metrics, shows how to use Google’s minimal‑ist benchmark tool to compare your organization against industry baselines, and discusses the emerging service‑operations efficiency metric.

Accelerate ReportBenchmarkingOperations

0 likes · 11 min read

Benchmark Your DevOps Performance with the 2019 Accelerate Report

Efficient Ops

Aug 28, 2019 · Operations

How to Harden Linux Server Security: Account, Login, and Boot Controls

This guide details practical Linux server hardening techniques—including account cleanup, password policies, su/sudo restrictions, login controls, and BIOS/GRUB protection—while providing exact command examples for operations teams to quickly improve system security.

Account ManagementLinuxOperations

0 likes · 12 min read

How to Harden Linux Server Security: Account, Login, and Boot Controls

Efficient Ops

Aug 26, 2019 · Operations

Why Are CLOSE_WAIT Sockets Sticking? Uncovering HttpClient’s Hidden Connection Leak

This article investigates persistent CLOSE_WAIT sockets in a Tomcat‑Nginx architecture, identifies HttpClient’s connection‑manager as the root cause, and details the step‑by‑step analysis and configuration changes that finally eliminated the issue.

CLOSE_WAITConnectionLeakHttpClient

0 likes · 7 min read

Why Are CLOSE_WAIT Sockets Sticking? Uncovering HttpClient’s Hidden Connection Leak

DevOps

Aug 24, 2019 · Operations

DevOps Engineers: The Highest‑Paid IT Role, Their Value, and How to Build a Career

The article explains why DevOps and SRE engineers top the 2019 StackOverflow IT job popularity list, outlines their responsibilities, career prospects, required skills, and provides practical advice for aspiring professionals.

AutomationCloudIT careers

0 likes · 8 min read

DevOps Engineers: The Highest‑Paid IT Role, Their Value, and How to Build a Career

Cloud Native Technology Community

Aug 21, 2019 · Industry Insights

What Does a DevOps Consultant Actually Do? A Real‑World Walkthrough

This article shares a DevOps consultant’s personal journey, detailing the diverse responsibilities, tools, and mindset required—from early full‑stack experience and virtualization research to CI/CD pipelines, infrastructure‑as‑code, security, load balancing, and fostering a DevOps culture across teams.

AutomationCI/CDConsulting

0 likes · 9 min read

What Does a DevOps Consultant Actually Do? A Real‑World Walkthrough

Youzan Coder

Aug 21, 2019 · Operations

How Opsflow Revolutionized Youzan's DevOps Workflow Management

This article examines the evolution of Youzan's Opsflow workflow engine, detailing its architecture, components, and how it solved numerous operational challenges such as low customizability, lack of progress visibility, and fragmented approval processes, while outlining its current status and future roadmap.

AutomationFinite State MachineOperations

0 likes · 13 min read

How Opsflow Revolutionized Youzan's DevOps Workflow Management

MaGe Linux Operations

Aug 20, 2019 · Operations

Master Linux Shutdown: Commands, History, and Best Practices

This article explains how to safely shut down or reboot Linux systems from the command line, covering the main commands, their Systemd origins, detailed usage options, scheduling tricks, and how to cancel pending shutdowns.

OperationsShutdownsystemd

0 likes · 6 min read

Master Linux Shutdown: Commands, History, and Best Practices

dbaplus Community

Aug 12, 2019 · Operations

Why DevOps Matters and How to Implement It: Practical Lessons from Vipshop

This article explains the need for DevOps, contrasts it with ITIL, outlines practical steps for implementation, and shares Vipshop’s component‑centric DevOps practice, including configuration platforms, risk‑matrix control, and continuous improvement metrics, offering engineers actionable insights for real‑world deployment.

Case StudyContinuous IntegrationITIL

0 likes · 12 min read

Why DevOps Matters and How to Implement It: Practical Lessons from Vipshop

DevOps Cloud Academy

Aug 12, 2019 · Operations

Ansible Installation and Basic Usage Guide

This guide walks through setting up a two‑node Linux environment, installing Ansible, configuring its inventory and SSH keys, and demonstrates common Ansible commands for managing hosts, checking connectivity, and executing remote tasks.

AnsibleAutomationLinux

0 likes · 5 min read

Ansible Installation and Basic Usage Guide

DevOps Cloud Academy

Aug 12, 2019 · Operations

Using Ansible Playbooks to Automate MySQL Installation and Cron Job Creation

This tutorial explains how to write Ansible Playbooks in YAML to automate complex tasks such as batch installing MySQL on remote servers and creating scheduled cron jobs, including code examples, parameter explanations, execution commands, and result verification.

AnsibleAutomationMySQL

0 likes · 4 min read

Using Ansible Playbooks to Automate MySQL Installation and Cron Job Creation

Python Crawling & Data Mining

Aug 10, 2019 · Operations

How to Fix ‘Physical Memory Insufficient’ Error When Starting a VMware Virtual Machine

This guide explains why VMware may report ‘physical memory insufficient’ when launching a VM and provides a step‑by‑step method to reduce the allocated memory, verify the settings, and successfully start the virtual machine.

OperationsTroubleshootingVMware

0 likes · 4 min read

How to Fix ‘Physical Memory Insufficient’ Error When Starting a VMware Virtual Machine

Efficient Ops

Aug 8, 2019 · Operations

10 Ops Murphy’s Laws Every Engineer Should Read Daily

This article shares a set of operational Murphy’s laws, practical process‑management tips, and automation strategies to help ops engineers reduce human error, improve safety, stability, efficiency, and cost‑saving in daily work.

AutomationOperationsincident response

0 likes · 9 min read

10 Ops Murphy’s Laws Every Engineer Should Read Daily

58 Tech

Aug 7, 2019 · Operations

An Overview of the USP Deployment System: Architecture, Models, and Key Features

This article presents a detailed overview of the 58 Deployment System (USP), covering its evolution, Java‑based architecture, communication and deployment models, traffic management, one‑stop and parallel deployments, gray‑scale rollout, fast rollback, task‑driven workflow, and future direction within private‑cloud environments.

AutomationContinuous IntegrationDeployment

0 likes · 8 min read

An Overview of the USP Deployment System: Architecture, Models, and Key Features

DevOps Engineer

Aug 6, 2019 · Operations

Managing Large‑Scale Jenkins CI/CD Pipelines with Centralized Libraries and the Remote File Plugin

The article explains how Jenkins' multi‑branch pipelines can be extended for enterprise‑scale CI/CD by using dynamic pipeline creation, centralized shared libraries, governance practices, and the Remote File Plugin to centralize, secure, and simplify pipeline script management across many projects.

CI/CDJenkinsOperations

0 likes · 6 min read

Managing Large‑Scale Jenkins CI/CD Pipelines with Centralized Libraries and the Remote File Plugin

ITPUB

Aug 5, 2019 · Operations

Mastering SSH Public‑Key Login for Batch Server Operations

This guide explains how SSH public‑key authentication works, walks through generating key pairs, shows the connection handshake, and demonstrates practical batch command execution and file collection across multiple Linux servers using ssh, scp, and nc.

LinuxOperationsPublic Key Authentication

0 likes · 9 min read

Mastering SSH Public‑Key Login for Batch Server Operations

ITPUB

Aug 5, 2019 · Operations

How a Midnight Migration Saved Millions: Lessons in Problem‑Solving for Developers

A senior engineer recounts a high‑pressure, overnight data‑migration from an overloaded legacy platform to a new micro‑service system, detailing the technical challenges, rapid troubleshooting, multithreaded workarounds, and the broader lessons on what truly makes a programmer great.

Operationsbackendmultithreading

0 likes · 16 min read

How a Midnight Migration Saved Millions: Lessons in Problem‑Solving for Developers

Efficient Ops

Aug 4, 2019 · Operations

How Capital One Migrated Its Docker Registry to Artifactory: A Practical Operations Case Study

This article details Capital One's migration from an open‑source Docker registry to JFrog Artifactory, covering the evaluation of Artifactory, ECR, and Harbor, the migration process, performance testing, and the resulting production rollout that now serves over 9 million images to 10,000+ developers.

ArtifactoryOperationsRegistry Migration

0 likes · 8 min read

How Capital One Migrated Its Docker Registry to Artifactory: A Practical Operations Case Study

Java Captain

Aug 3, 2019 · Operations

Practical Guide to Viewing Logs, Processes, Ports, and System Status on Linux

This article provides a comprehensive, step‑by‑step tutorial on using Linux command‑line tools such as cat, tail, vim, grep, sed, ps, netstat, lsof, and free to efficiently view large log files, locate specific entries, monitor processes and ports, and assess overall system health.

LinuxOperationslog management

0 likes · 8 min read

Practical Guide to Viewing Logs, Processes, Ports, and System Status on Linux

iQIYI Technical Product Team

Aug 2, 2019 · Operations

iQIYI CDN IPv6 Deployment Architecture and Implementation

iQIYI’s CDN scheduling system was redesigned for dual‑stack IPv4/IPv6, adding Anycast DNS, IPv6‑aware probes, and hybrid CDN integration, while upgrading data‑center, backbone, and server configurations through automated SDN and management platforms, enabling over 100 million IPv6 users and gigabit‑scale traffic.

CDNIPv6Operations

0 likes · 18 min read

iQIYI CDN IPv6 Deployment Architecture and Implementation

Ops Development Stories

Jul 29, 2019 · Operations

Mastering Nginx Reverse Proxy, Load Balancing, and Caching

This article explains how to configure Nginx as a reverse proxy, implement load‑balancing strategies, separate static and dynamic content, set up proxy caching with various directives, purge caches, and enable gzip compression, providing complete code examples and practical testing results.

CachingNginxOperations

0 likes · 17 min read

Mastering Nginx Reverse Proxy, Load Balancing, and Caching

DevOps

Jul 29, 2019 · Operations

Google’s Continuous Delivery Practices and SRE Culture: A DevOps Case Study

This article examines Google’s corporate values, development history, culture, and detailed DevOps and Site Reliability Engineering practices—including continuous delivery, SRE responsibilities, and Google Cloud Platform CI/CD tools—to illustrate how the company achieves 24/7 reliable service deployment at massive scale.

GoogleOperationsSRE

0 likes · 15 min read

DevOps

Jul 26, 2019 · Operations

Amazon’s DevOps Journey: From Customer Obsession to Continuous Delivery

This article examines Amazon’s evolution—from its early focus on books and relentless customer obsession to the adoption of micro‑service architecture, two‑pizza teams, and a high‑velocity continuous delivery pipeline—illustrating how strategic cultural and technical choices drive massive operational efficiency.

AmazonCustomer ObsessionOperations

0 likes · 9 min read

Amazon’s DevOps Journey: From Customer Obsession to Continuous Delivery

Efficient Ops

Jul 25, 2019 · Operations

How Tencent’s Ops Teams Move Massive Workloads to the Cloud and Boost Efficiency

Tencent’s recent Operations Open Day showcased how its engineers migrated billions of users to public cloud, leveraged cloud‑native DevOps, serverless functions, and intelligent data‑center management to dramatically improve efficiency, scalability, and reliability across its massive infrastructure.

OperationsServerlesscloud-native

0 likes · 9 min read

How Tencent’s Ops Teams Move Massive Workloads to the Cloud and Boost Efficiency

DevOps

Jul 25, 2019 · Operations

Why DevOps Teams Often Turn Into Tool Chains and What an Ideal DevOps Team Structure Looks Like

The article analyzes why many DevOps teams devolve into tool‑chain or pipeline roles, examines executor and organizational factors, presents a six‑role DevOps team model linked to the Six Thinking Hats, shares community viewpoints on role prioritization, and concludes that DevOps structures must be tailored to solve concrete business problems rather than follow a fixed standard.

OperationsTeam StructureTool Chain

0 likes · 12 min read

Why DevOps Teams Often Turn Into Tool Chains and What an Ideal DevOps Team Structure Looks Like

58 Tech

Jul 23, 2019 · Operations

Design and Implementation of an Open Alarm Platform for Monitoring Systems

The Open Alarm Platform provides a flexible data model, modular architecture, and robust stability features to enable various business lines to integrate their custom monitoring systems via APIs, offering alert convergence, merging, multi‑channel delivery, and comprehensive management while reducing development and maintenance costs.

AlertingIncident ManagementOperations

0 likes · 9 min read

Design and Implementation of an Open Alarm Platform for Monitoring Systems

Xianyu Technology

Jul 23, 2019 · Operations

Automated Service Fault Localization System Architecture

The automated service fault localization system ingests massive real‑time instrumentation data, builds call‑chain graphs, and instantly pinpoints the exact component causing timeouts or other errors, achieving developer‑level accuracy within seconds instead of minutes while remaining simple, fast, and fully automated.

Big DataFault LocalizationOperations

0 likes · 8 min read

Automated Service Fault Localization System Architecture