Tagged articles
131 articles
Page 2 of 2
58 Tech
58 Tech
Jul 23, 2019 · Operations

Design and Implementation of an Open Alarm Platform for Monitoring Systems

The Open Alarm Platform provides a flexible data model, modular architecture, and robust stability features to enable various business lines to integrate their custom monitoring systems via APIs, offering alert convergence, merging, multi‑channel delivery, and comprehensive management while reducing development and maintenance costs.

AlertingOperationsScalability
0 likes · 9 min read
Design and Implementation of an Open Alarm Platform for Monitoring Systems
dbaplus Community
dbaplus Community
Jul 7, 2019 · Operations

Turning Online Incidents into Growth: From Firefighting to Real Technical Mastery

The article reflects on handling online incidents by first extinguishing the immediate problem, then digging into root causes, and expands the discussion to what truly constitutes technical ability, the pitfalls of reinventing solutions, raising one’s perspective, and the critical role of systematic retrospection.

Software Engineeringincident managementproblem solving
0 likes · 12 min read
Turning Online Incidents into Growth: From Firefighting to Real Technical Mastery
DevOps
DevOps
Jun 25, 2019 · Operations

Applying Emergency Room Principles to IT Operations: Kanban, Scrum, and Prioritization

The article draws parallels between emergency rooms and IT operations, describing how Kanban's WIP limits, one‑to‑one liaison models, transparent dashboards, and Scrum time‑boxing (daily stand‑ups and weekly reviews) help a globally distributed team prioritize urgent incidents while still advancing important non‑urgent work.

DevOpsIT Operationsincident management
0 likes · 10 min read
Applying Emergency Room Principles to IT Operations: Kanban, Scrum, and Prioritization
Efficient Ops
Efficient Ops
Oct 29, 2018 · Operations

How Youzan Manages Online Incidents: A Step‑by‑Step Guide

This article outlines Youzan's end‑to‑end online incident management process—from fault detection and coordination through root‑cause analysis, recovery, review, and actionable JIRA tracking—highlighting practical workflows, data analysis, and continuous improvement practices for reliable service delivery.

JIRA workflowOperationsfault handling
0 likes · 10 min read
How Youzan Manages Online Incidents: A Step‑by‑Step Guide
Efficient Ops
Efficient Ops
Sep 24, 2018 · Operations

How Checklist Thinking Fuels Ops Professionals' Lifelong Growth

This talk explores how ops engineers can achieve continuous professional development by adopting checklist thinking, covering growth drivers, error classification, practical checklist applications, cognitive models, and design principles that turn complex incidents into systematic, repeatable processes.

DevOpsGrowthOperations
0 likes · 34 min read
How Checklist Thinking Fuels Ops Professionals' Lifelong Growth
dbaplus Community
dbaplus Community
May 8, 2018 · Operations

How to Build Reliable Operations: From BCM to Google SRE Practices

This article examines the growing challenges of system availability in modern operations, explains the concept of availability and the N‑nine metric, introduces Business Continuity Management and Google SRE approaches, and provides concrete technical and managerial methods—including architecture standardization, scaling strategies, tooling, emergency drills, and incident‑centralized management—to improve operational reliability.

AvailabilityBCMOperations
0 likes · 30 min read
How to Build Reliable Operations: From BCM to Google SRE Practices
MaGe Linux Operations
MaGe Linux Operations
May 8, 2018 · Operations

What My First Day in Ops Taught Me About Mistakes and Teamwork

A personal account of switching jobs, the challenges of a first production shift, a critical menu‑click error, the team’s rapid response, and the lasting operational lessons learned about risk awareness, double‑checking, and continuous improvement.

Lessons Learnedcareer transitionincident management
0 likes · 7 min read
What My First Day in Ops Taught Me About Mistakes and Teamwork
Efficient Ops
Efficient Ops
Jan 11, 2018 · Operations

Mastering Incident Troubleshooting: Proven SRE Strategies for Operations

This article shares practical SRE‑based principles for diagnosing and resolving online incidents, emphasizing systematic investigation, gathering clues, and prioritizing service restoration over immediate root‑cause identification to make troubleshooting less mystical and more effective.

OperationsSREincident management
0 likes · 7 min read
Mastering Incident Troubleshooting: Proven SRE Strategies for Operations
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 27, 2017 · Operations

Efficient Ticket System Operations During Double 11 Promotion

The article describes how a ticketing system with strict SLA enforcement, automated routing, and team‑based service management enabled rapid, orderly issue handling during the high‑volume Double 11 shopping event, achieving near‑90% resolution within 30 minutes and improving overall business stability.

Double 11OperationsSLA
0 likes · 7 min read
Efficient Ticket System Operations During Double 11 Promotion
Architecture Digest
Architecture Digest
Nov 1, 2017 · Operations

A Structured Approach to Online System Issue Diagnosis and Recovery

This article outlines a systematic methodology for understanding, evaluating, and quickly resolving production system incidents by categorizing system layers, assessing impact, employing Linux diagnostic tools, and designing fault‑tolerant mechanisms to minimize downtime and maintain core functionality.

BackendLinux toolsOperations
0 likes · 12 min read
A Structured Approach to Online System Issue Diagnosis and Recovery
dbaplus Community
dbaplus Community
Oct 16, 2017 · Operations

How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning

This article details Ele.me's rapid expansion challenges and shares a three‑stage technical operations journey—fine‑grained division, stability maintenance, and efficiency gains—highlighting real incidents, monitoring upgrades, capacity testing, and practical insights for reliable large‑scale delivery platforms.

Operationscapacity planningincident management
0 likes · 14 min read
How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning
Efficient Ops
Efficient Ops
Oct 9, 2017 · Operations

How Tencent Scales Operations for Holiday Traffic Surges

This article explains how Tencent's social platform operations team prepares for massive holiday traffic spikes by following a four‑stage process—business preparation, capacity evaluation, resource provisioning, and scaling with stress testing—while detailing team structures, operational standards, and the supporting tool ecosystem that enable reliable, high‑availability services.

OperationsToolingcapacity planning
0 likes · 13 min read
How Tencent Scales Operations for Holiday Traffic Surges
ITPUB
ITPUB
May 15, 2017 · Operations

Mastering Online Incident Management: From Detection to Prevention

This article outlines a comprehensive methodology for handling large‑scale online service incidents, covering goals, the "jump‑fill‑avoid" framework, step‑by‑step processes for detection, diagnosis, remediation, and post‑mortem analysis, as well as essential monitoring, logging, and escalation infrastructure.

OperationsSREincident management
0 likes · 18 min read
Mastering Online Incident Management: From Detection to Prevention
ITPUB
ITPUB
Feb 21, 2017 · Operations

How We Resolved a Sudden DNS Outage That Took Down Our Website and App

When a Saturday early-morning outage left the company’s website and mobile app inaccessible for many users, the team traced the issue to an unpaid domain causing DNS resolution failures, detailed the investigation steps, temporary fixes, and lessons learned about DNS processes and operational readiness.

DNSOutageincident management
0 likes · 13 min read
How We Resolved a Sudden DNS Outage That Took Down Our Website and App
Efficient Ops
Efficient Ops
Dec 28, 2016 · Operations

Transforming Financial Application Operations: Lessons from a European Rollout

This article shares a detailed case study of how a financial services team restructured European application operations, applied lean retrospectives, built a top‑down monitoring system, and introduced systematic stakeholder collaboration to dramatically improve incident response, system robustness, and user satisfaction.

DevOpsOperationsapplication monitoring
0 likes · 14 min read
Transforming Financial Application Operations: Lessons from a European Rollout
Efficient Ops
Efficient Ops
Dec 19, 2016 · Operations

What 16 Major 2016 Outages Teach Us About Disaster Recovery

This article reviews sixteen notable 2016 service outages across finance, cloud, and entertainment, analyzes their causes—ranging from power failures to DDoS attacks—and highlights the critical need for robust disaster‑recovery and information‑security practices.

Operationsincident managementinformation security
0 likes · 11 min read
What 16 Major 2016 Outages Teach Us About Disaster Recovery
Efficient Ops
Efficient Ops
Nov 27, 2016 · Operations

When Ops Heroes Burn Out: Tackling Personal Heroism in Operations

The article explores personal heroism in operations, defining it as reliance on individual effort to keep flawed systems appearing normal, examines its short‑term benefits and long‑term drawbacks for companies, teams, and the heroes themselves, and offers practical strategies to eliminate this risky mindset.

OperationsSLATeam Culture
0 likes · 10 min read
When Ops Heroes Burn Out: Tackling Personal Heroism in Operations
Efficient Ops
Efficient Ops
Oct 23, 2016 · Operations

How Google’s SRE Postmortems Drive System Reliability

This article explains Google’s SRE postmortem philosophy, the criteria for writing postmortems, best practices for a blame‑free culture, and how collaborative knowledge‑sharing and incentives improve incident handling and overall system reliability.

OperationsSREincident management
0 likes · 14 min read
How Google’s SRE Postmortems Drive System Reliability
Efficient Ops
Efficient Ops
Oct 6, 2016 · Operations

How Ctrip Scales Application Operations: Practices, Automation, and Reliability

This talk details Ctrip's application operations framework, covering data‑center scale, multi‑application deployment on Windows, high availability goals, capacity‑prediction models, disaster‑recovery design, incident response, and the evolution from manual tooling to automated, intelligent operations.

Operationsautomationcapacity planning
0 likes · 21 min read
How Ctrip Scales Application Operations: Practices, Automation, and Reliability
Efficient Ops
Efficient Ops
May 17, 2016 · Operations

When a Single Cable Crashes a Network: Real Ops Incident Lessons

This article recounts two real‑world operations incidents—a network outage caused by an improperly configured portfast on a trunk link and an NFS failure that crippled an API service—then distills practical lessons on pre‑incident procedures, monitoring, fault handling, recovery, and post‑mortem practices.

ITILNFSOperations
0 likes · 11 min read
When a Single Cable Crashes a Network: Real Ops Incident Lessons
Efficient Ops
Efficient Ops
Feb 3, 2016 · Operations

Why Human Errors Still Plague Modern Ops and How to Prevent Them

This article examines recent high‑profile internet outages caused by human error, explores why operations teams are especially prone to mistakes despite automation and standards, and offers practical strategies—such as hiring the right people, fostering safety awareness, and turning professionalism into habit—to reduce future incidents.

Operationsautomationbest practices
0 likes · 14 min read
Why Human Errors Still Plague Modern Ops and How to Prevent Them
Efficient Ops
Efficient Ops
Sep 13, 2015 · Operations

How Tencent’s BlueKing Platform Automates Ops: Key Takeaways from the Efficient Operations Talk

This article summarizes a detailed Q&A from the Efficient Operations talk, covering BlueKing’s integration with databases, agent resource management, alarm de‑duplication, automation workflows, development language choices, data handling, and the platform’s suitability for various enterprise environments.

BlueKingDatabase operationsDevOps
0 likes · 13 min read
How Tencent’s BlueKing Platform Automates Ops: Key Takeaways from the Efficient Operations Talk
Efficient Ops
Efficient Ops
May 29, 2015 · Operations

Why Ctrip’s Outage Took Hours to Recover – Lessons for Ops Teams

The article examines Ctrip’s prolonged service restoration after a May 28 incident, analyzing the complexities of SOA‑based architectures, the pitfalls of black‑box operations, and how transitioning to white‑box, DevOps‑aligned practices can prevent similar outages.

Configuration ManagementDevOpsIT Operations
0 likes · 11 min read
Why Ctrip’s Outage Took Hours to Recover – Lessons for Ops Teams