Tag

incident response

0 views collected around this technical thread.

Efficient Ops
Efficient Ops
Jun 9, 2025 · Operations

How OnCall Platforms Transform Incident Management and Reduce Manual Overhead

This article explains the purpose and key features of OnCall platforms, compares popular solutions like PagerDuty, Opsgenie, Grafana OnCall and Alibaba Cloud ARMS, clarifies webhooks with a simple analogy, and summarizes how centralized on‑call management boosts operational efficiency while minimizing manual intervention.

OncallOperationsincident response
0 likes · 5 min read
How OnCall Platforms Transform Incident Management and Reduce Manual Overhead
Efficient Ops
Efficient Ops
May 20, 2025 · Information Security

How an Overseas Hacker Group Disrupted a Guangzhou Tech Company's Services

A coordinated overseas cyber‑attack breached a Guangzhou tech firm's self‑service equipment backend, causing hours of service outage, data leakage, and significant losses, prompting swift police investigation, evidence preservation, and a detailed technical analysis of the attackers' methods.

Chinacybersecurityhacker group
0 likes · 4 min read
How an Overseas Hacker Group Disrupted a Guangzhou Tech Company's Services
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Feb 27, 2025 · Operations

How 360’s Unified Alert Service Boosts System Reliability and Cuts MTTR

This article explains the importance, pain points, architecture, core capabilities, and future roadmap of the 360 Zhihui Cloud "Yunzhou" unified alert service, showing how it improves observability, reduces alert noise, and accelerates incident response for modern cloud‑native systems.

Operationsalertingcloud-native
0 likes · 14 min read
How 360’s Unified Alert Service Boosts System Reliability and Cuts MTTR
Efficient Ops
Efficient Ops
Feb 20, 2025 · Information Security

How a Maintenance Staff Leak Exposed Security Gaps and How to Prevent It

A recent case where a maintenance worker exploited device‑management flaws to steal confidential files for foreign spies highlights the need for heightened vigilance, strict self‑discipline, and prompt reporting, offering practical steps to safeguard against similar security breaches.

data leakageincident responseinformation security
0 likes · 4 min read
How a Maintenance Staff Leak Exposed Security Gaps and How to Prevent It
DataFunSummit
DataFunSummit
Feb 13, 2025 · Information Security

Building and Optimizing a Comprehensive Security System: Practices, Innovations, and Future Outlook

This article presents a detailed walkthrough of constructing a robust security architecture, covering single‑person security team strategies, risk perception and quantification, rapid incident response, automated detection, precise strike mechanisms, deterrence tactics, and forward‑looking plans for intelligent, data‑driven risk management.

SecuritySecurity Architectureautomation
0 likes · 21 min read
Building and Optimizing a Comprehensive Security System: Practices, Innovations, and Future Outlook
DevOps Operations Practice
DevOps Operations Practice
Dec 8, 2024 · Information Security

Incident Report: Investigating and Removing a Server Malware Causing 100% CPU Usage

This article documents a step‑by‑step investigation of a compromised Linux server that exhibited 100% CPU usage, detailing process, network, and startup‑service analysis, the discovery of a cryptomining malware, and the complete removal procedure.

CPULinuxNetwork
0 likes · 5 min read
Incident Report: Investigating and Removing a Server Malware Causing 100% CPU Usage
Efficient Ops
Efficient Ops
Nov 12, 2024 · Operations

How to Build Robust Online Stability: Practices, Metrics, and Team Strategies

This article outlines a comprehensive approach to online stability, covering preventive measures, service governance, capacity planning, incident detection, multi‑dimensional monitoring, alerting, R&D efficiency improvements, team building, and practical guidelines for simplifying, standardizing, automating, and scaling stability initiatives across an organization.

Operationsautomationincident response
0 likes · 15 min read
How to Build Robust Online Stability: Practices, Metrics, and Team Strategies
Java Architect Essentials
Java Architect Essentials
Oct 7, 2024 · Information Security

Insider Ransomware Attack by a Former Engineer: Case Study and Security Lessons

A disgruntled former infrastructure engineer at a U.S. industrial firm deleted backups, locked administrators, and demanded $750,000 in Bitcoin, leading to his arrest and highlighting the severe risks, legal consequences, and mitigation strategies associated with insider ransomware threats.

IT governanceincident responseinformation security
0 likes · 10 min read
Insider Ransomware Attack by a Former Engineer: Case Study and Security Lessons
Efficient Ops
Efficient Ops
Aug 20, 2024 · Information Security

Building a One‑Person PCI DSS Image‑Signing Service and Surviving a P0 Outage

This article recounts how a solo developer built a Django‑based Docker image signing service to meet PCI DSS requirements, faced two severe incidents—including a 17.5‑hour P0 outage caused by concurrency limits and a misconfigured Rekor service—and shares the operational lessons learned for reliable SRE practice.

DjangoPCI DSSSRE
0 likes · 9 min read
Building a One‑Person PCI DSS Image‑Signing Service and Surviving a P0 Outage
JD Tech
JD Tech
Jul 8, 2024 · Operations

System Stability Practices: From Development to Production

This article outlines comprehensive system stability strategies for backend development, covering technical design reviews, key reliability techniques such as rate limiting, circuit breaking, timeout handling, isolation, and deployment safeguards like monitoring, gray releases, and rollback, aiming to reduce incidents and improve operational resilience.

Backend DevelopmentDeploymentSystem Stability
0 likes · 26 min read
System Stability Practices: From Development to Production
Efficient Ops
Efficient Ops
May 21, 2024 · Operations

What Is an SRE? Roles, Skills, and Best Practices Explained

This article demystifies Site Reliability Engineering (SRE) by explaining its origins, core responsibilities, essential skill sets, and key practices such as observability, incident response, testing, capacity planning, automation, user support, on‑call duties, and the definition of SLI/SLO/SLA, providing a comprehensive guide for modern operations teams.

Capacity PlanningOperationsSRE
0 likes · 29 min read
What Is an SRE? Roles, Skills, and Best Practices Explained
Wukong Talks Architecture
Wukong Talks Architecture
Apr 15, 2024 · Operations

Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures

On April 8, a Tencent Cloud API outage caused console login failures for nearly 2,000 customers, affecting several dependent services for 87 minutes, and the detailed root‑cause analysis and subsequent improvement actions are presented to enhance system resilience and change management.

APIOperationsTencent Cloud
0 likes · 8 min read
Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures
Zhuanzhuan Tech
Zhuanzhuan Tech
Feb 21, 2024 · Operations

Network Operations Incident Report: BGP Routing Failure and Resolution

This report details a network operations incident where a BGP routing change caused an EBGP neighbor to go idle, outlines the step‑by‑step troubleshooting, analysis of the root cause, and the implemented solution involving a new L3 node and redundant EBGP peers.

BGPCloud Networkingincident response
0 likes · 8 min read
Network Operations Incident Report: BGP Routing Failure and Resolution
Efficient Ops
Efficient Ops
Jan 29, 2024 · Operations

Mastering Incident Response: A Practical Guide to Faster Service Recovery

This guide walks ops teams through real‑world incident scenarios, from quick symptom identification and emergency recovery to improving monitoring, crafting concise emergency plans, and leveraging automation for smarter fault handling, helping organizations restore services faster and reduce downtime.

Operationsemergency planningfault handling
0 likes · 14 min read
Mastering Incident Response: A Practical Guide to Faster Service Recovery
Architect
Architect
Nov 17, 2023 · Information Security

A Real-World Incident of Accidental Public Snapshot Sharing and Lessons Learned

The author recounts a 2018 incident where a cloud disk snapshot was unintentionally made public, exposing customer data, and shares a detailed reflection on the operational mistakes, risk management failures, and recommended safeguards for high‑risk cloud operations.

Cloud Computingdata securityincident response
0 likes · 9 min read
A Real-World Incident of Accidental Public Snapshot Sharing and Lessons Learned
JD Retail Technology
JD Retail Technology
Nov 13, 2023 · Information Security

Red‑Blue Adversarial Testing for a Big Data Platform: Process, Benefits, and Best Practices

This article outlines the red‑blue adversarial testing process for a big‑data platform during the Double‑Eleven promotion, detailing its purpose, benefits, step‑by‑step execution, common issues, and recommendations to improve system reliability and security.

Big DataChaos Engineeringincident response
0 likes · 12 min read
Red‑Blue Adversarial Testing for a Big Data Platform: Process, Benefits, and Best Practices
Architecture and Beyond
Architecture and Beyond
Nov 12, 2023 · Frontend Development

Designing a Yellow Banner System for User Notification During Service Outages

The article explains how a configurable yellow banner system can be used on web interfaces to promptly inform users about service disruptions, guide their actions, increase transparency, improve experience, and outline implementation considerations such as configurability, persistence, and independent deployment.

frontendincident responsenotification
0 likes · 6 min read
Designing a Yellow Banner System for User Notification During Service Outages
JD Tech
JD Tech
Nov 10, 2023 · Operations

Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices

This article explains the concept and importance of MTTR (Mean Time To Repair), shows how to calculate it, and provides a comprehensive set of monitoring, alerting, rapid mitigation, tool‑assisted analysis, and team coordination techniques to significantly shorten incident resolution time and improve system reliability.

MTTROperationsSRE
0 likes · 26 min read
Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices
Bilibili Tech
Bilibili Tech
Sep 8, 2023 · Operations

Design, Implementation, and Governance of an Alert Management Platform

The article details Bilibili’s comprehensive alert‑management platform—its background, cloud‑vs‑self‑built solution comparison, closed‑loop design, distributed architecture, rule configuration, noise‑reduction, automated root‑cause analysis, and governance practices that cut weekly alerts from 1,000 to under 80, while outlining future enhancements.

DevOpsSREalert management
0 likes · 19 min read
Design, Implementation, and Governance of an Alert Management Platform
Didi Tech
Didi Tech
Aug 31, 2023 · Big Data

Data Stability Construction and Fault Governance Practices at Didi Customer Service

Didi’s multi‑year data‑stability program for its customer‑service platform progressed through fault‑centered engineering, business‑aligned cross‑team work, and capability normalization, instituting pre‑, mid‑ and post‑fault safeguards, clear ownership, automated alerts and repair tools, which cut fault count by 42 % and more than doubled mean‑time‑to‑repair while boosting team communication and satisfaction.

Data ReliabilityData StabilityData Warehouse
0 likes · 16 min read
Data Stability Construction and Fault Governance Practices at Didi Customer Service