Tagged articles

incident response

230 articles · Page 1 of 3

Jun 24, 2026 · Operations

Will AI Replace Ops Engineers by 2025? From Automated Troubleshooting to One‑Click Deployments

The article examines how AI is reshaping operations—from instant fault detection and 47‑second incident resolution to natural‑language deployment scripts, predictive capacity planning, continuous security monitoring, and automated knowledge bases—while arguing that engineers will transition from fire‑fighters to system designers.

AIOpsAutomationcapacity planning

0 likes · 15 min read

Will AI Replace Ops Engineers by 2025? From Automated Troubleshooting to One‑Click Deployments

Raymond Ops

Jun 23, 2026 · Information Security

Linux Intrusion Detection and Incident Response: A Practical Guide to Security Event Investigation

This guide walks through building a layered intrusion detection system on Linux, comparing HIDS tools such as AIDE, rkhunter, and auditd, detailing installation, configuration, baseline management, automated response scripts, forensic data collection, monitoring, and best‑practice hardening for effective security event investigation and remediation.

AIDEIntrusion DetectionLinux

0 likes · 48 min read

Linux Intrusion Detection and Incident Response: A Practical Guide to Security Event Investigation

Black & White Path

Jun 18, 2026 · Information Security

Inside the AI‑Powered Hack: Full Claude & Codex Attack Log Exposed

OALABS recovered over 1,000 Claude and Codex session logs from a compromised server, revealing how the attackers duplicated AI agents, used them for reconnaissance, vulnerability exploitation, data theft, and even attempted cryptocurrency cracking across at least 14 companies, demonstrating that AI agents can dramatically lower the technical barrier for sophisticated cyber‑attacks.

AI securityClaudeCodex

0 likes · 49 min read

Inside the AI‑Powered Hack: Full Claude & Codex Attack Log Exposed

MaGe Linux Operations

Jun 11, 2026 · Information Security

Redis Mining Attack: Full Incident Response Timeline from Alert to Hardening

This article provides a step‑by‑step engineering‑level walkthrough of a real Redis mining breach, covering everything from the initial alert, evidence collection, and process termination to crontab cleanup, SSH key removal, system hardening, monitoring setup, and post‑mortem analysis.

LinuxMonitoringRedis

0 likes · 51 min read

Redis Mining Attack: Full Incident Response Timeline from Alert to Hardening

IT Services Circle

May 24, 2026 · Information Security

Fired, He Deleted 96 Government Databases in Minutes and Asked AI How to Clear Logs

Just five minutes after being terminated, twin brothers with prior fraud convictions used SQL commands to drop 96 U.S. government databases, queried AI on log‑clearing techniques, and exposed critical failures in the company's off‑boarding process, leading to a high‑profile federal investigation and legal fallout.

AISQLdatabase breach

0 likes · 9 min read

Fired, He Deleted 96 Government Databases in Minutes and Asked AI How to Clear Logs

Architecture & Thinking

May 20, 2026 · Operations

Six‑Step Emergency Plan to Detect, Recover, and Eliminate Message Backlog

In distributed systems, message‑queue backlogs can cripple core services; this article breaks down a six‑step emergency workflow—from alert detection and throttling to temporary scaling, root‑cause analysis, targeted fixes, and final validation—plus long‑term architectural and monitoring strategies, illustrated with real‑world cases and Java code samples.

BacklogJavaMessage Queue

0 likes · 21 min read

Six‑Step Emergency Plan to Detect, Recover, and Eliminate Message Backlog

Black & White Path

May 15, 2026 · Information Security

Twin Brothers Delete 96 Government Databases – A Privileged‑Account Failure Case Study

In 2025, twin brothers with prior cyber‑crime convictions exploited a privileged‑account gap at a federal‑service contractor, erased 96 government databases within six minutes, used AI to seek log‑clearing methods, and triggered a multi‑layered forensic and legal response that highlights critical gaps in identity‑access management, backup integrity, and insider‑threat detection.

AI-assisted attackMITRE ATT&CKSecurity Monitoring

0 likes · 13 min read

Twin Brothers Delete 96 Government Databases – A Privileged‑Account Failure Case Study

Black & White Path

May 11, 2026 · Information Security

State‑Sponsored Actors Gain Root on Palo Alto PAN‑OS via Captive Portal Buffer Overflow

A detailed analysis of CVE‑2026‑0300 reveals how a nation‑backed group exploited a buffer‑overflow in PAN‑OS's Captive Portal to obtain root on Palo Alto firewalls, outlining the attack chain, affected versions, immediate mitigations, long‑term remediation, compliance impacts, and lessons learned.

CVE-2026-0300Captive PortalPAN-OS

0 likes · 12 min read

State‑Sponsored Actors Gain Root on Palo Alto PAN‑OS via Captive Portal Buffer Overflow

Ops Community

May 4, 2026 · Information Security

Investigating and Securing a Server After a Suspicious Login

When a production server shows unexpected high CPU usage and unknown login activity, this guide walks Linux ops engineers through confirming intrusion, stopping the attacker, tracing the attack path, removing backdoors, restoring system integrity, and applying hardening measures to prevent future breaches.

ForensicsLinuxRootkit Detection

0 likes · 27 min read

Investigating and Securing a Server After a Suspicious Login

DevOps Coach

Apr 27, 2026 · Operations

How a 2 AM Kubernetes Change Cost $47,000: My Nightmare Incident and 7 Lessons

A mis‑timed production resource change triggered a cascading Kubernetes failure that cost $47,000, and the author details the incident timeline, mistakes made, and seven concrete operational safeguards introduced to prevent similar outages.

circuit breakingincident responsekubernetes

0 likes · 12 min read

How a 2 AM Kubernetes Change Cost $47,000: My Nightmare Incident and 7 Lessons

FunTester

Apr 27, 2026 · Operations

Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help

The article explains that large‑scale incidents overwhelm on‑call engineers who must manually piece together context from countless signals, and shows how a self‑healing automation platform can take over repetitive, known failure patterns, verify fixes, and reduce fatigue while keeping humans in the loop for oversight.

AutomationOperationsPlatform Engineering

0 likes · 8 min read

Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help

Linyb Geek Road

Apr 25, 2026 · Information Security

How to Build Enterprise System Stability and Ensure Security?

The article outlines practical expert guidance for improving enterprise system reliability and security, covering architecture reviews, risk matrices, change management, continuous monitoring, incident response plans, one‑click escape mechanisms, security perimeter defenses, detection, leakage prevention, compliance, and ongoing security operations.

Defensive ProgrammingMonitoringRisk Management

0 likes · 11 min read

How to Build Enterprise System Stability and Ensure Security?

Software Engineering 3.0 Era

Apr 21, 2026 · Operations

Can AI Be Blamed for a 9‑Hour Travel App Outage? Lessons on Software Engineering Discipline

A nine‑hour outage of the popular travel app exposed how reliance on AI‑generated code can mask deeper failures in disaster‑recovery planning, incident response, and engineering rigor, reminding developers that high availability depends on disciplined practices rather than tools.

AI codeHigh Availabilityincident response

0 likes · 5 min read

Can AI Be Blamed for a 9‑Hour Travel App Outage? Lessons on Software Engineering Discipline

Raymond Ops

Apr 20, 2026 · Operations

How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates

This article presents a complete SRE on‑call handbook that defines alert severity levels, provides concrete Prometheus Alertmanager configurations, outlines a step‑by‑step response flow, details war‑room roles, escalation paths, handoff checklists, post‑mortem procedures, and dozens of ready‑to‑use templates to reduce MTTR and improve reliability.

Alert ManagementOn-CallOperations

0 likes · 27 min read

How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates

Black & White Path

Apr 17, 2026 · Information Security

Threat Alert: Cloud‑Native Cybercrime Group TeamPCP Targets Docker, Kubernetes, and Redis

TeamPCP, a newly identified cloud‑native threat group, has compromised at least 60,000 servers worldwide by exploiting exposed Docker APIs, Kubernetes clusters, Redis instances, and the React2Shell vulnerability, employing automated tools such as proxy.sh, kube.py, and react.py, with detailed MITRE ATT&CK mapping and concrete defense recommendations.

DockerMITRE ATT&CKMalware Analysis

0 likes · 16 min read

Threat Alert: Cloud‑Native Cybercrime Group TeamPCP Targets Docker, Kubernetes, and Redis

dbaplus Community

Apr 14, 2026 · Information Security

How to Investigate and Respond to Kubernetes Cluster Intrusions

This guide walks through practical techniques for detecting, tracing, and remediating Kubernetes cluster compromises, covering pod‑level debugging, node inspection, audit‑log analysis, and common attacker behaviors such as privileged pod creation and hostPath mounting.

Cluster ForensicsPod Debuggingaudit logs

0 likes · 7 min read

How to Investigate and Respond to Kubernetes Cluster Intrusions

Black & White Path

Apr 11, 2026 · Information Security

A Beginner’s Struggle: Securing a Compromised ThinkPHP Site Over Several Days

The author recounts a multi‑day incident response to a ThinkPHP website that was compromised via a weak admin password, detailing how repeated data tampering, hidden scheduled‑task scripts, and a ransom message were investigated, mitigated, and finally contained through systematic hardening and monitoring.

MalwarePHPThinkPHP

0 likes · 7 min read

A Beginner’s Struggle: Securing a Compromised ThinkPHP Site Over Several Days

Alibaba Cloud Native

Apr 10, 2026 · Cloud Native

How HiClaw Automates Crash Alert Analysis with AI Agents in a Cloud‑Native Environment

This article details the design and workflow of HiClaw, an AI‑driven, cloud‑native system that intercepts DingTalk crash alerts, isolates analysis in secure containers, and automatically generates actionable reports, dramatically reducing manual investigation time while complying with strict internal security policies.

AIAutomationincident response

0 likes · 15 min read

How HiClaw Automates Crash Alert Analysis with AI Agents in a Cloud‑Native Environment

Black & White Path

Apr 7, 2026 · Information Security

Ransomware ‘Shaming’ Attacks Surge: Over 2,000 Companies Exposed in 2026

Ransomware groups are increasingly using double‑extortion "shaming" tactics, publicly leaking stolen data to pressure victims, with Breachsense reporting more than 2,000 compromised firms in 2026, a 40% rise projected for the year, prompting new defensive strategies across industries.

Threat Intelligencecybersecuritydata breach

0 likes · 10 min read

Ransomware ‘Shaming’ Attacks Surge: Over 2,000 Companies Exposed in 2026

ITPUB

Mar 30, 2026 · Information Security

Essential Network Security FAQ: 100+ Key Concepts Explained

This comprehensive guide defines network security, outlines its core attributes, enumerates common threats and attack types, and provides practical mitigation strategies, covering everything from encryption basics and access controls to advanced topics like zero‑day vulnerabilities, zero‑trust architecture, and security automation.

Access ControlEncryptionThreats

0 likes · 44 min read

Essential Network Security FAQ: 100+ Key Concepts Explained

ITPUB

Mar 23, 2026 · Information Security

Essential Network Security Q&A: From Fundamentals to Advanced Threats

This comprehensive guide answers 100 common network security questions, covering basic concepts, core properties, threat sources, attack types, encryption methods, access controls, incident response, and emerging technologies such as zero‑trust, quantum encryption, and SOAR.

Access ControlEncryptionThreats

0 likes · 44 min read

Essential Network Security Q&A: From Fundamentals to Advanced Threats

Black & White Path

Mar 12, 2026 · Information Security

When 1 Billion IDs Leak: Inside the Biggest Identity Verification Breach Ever

A leading identity verification provider exposed over one billion personal records after a cloud storage bucket was misconfigured, revealing names, IDs, biometric data and more; the breach impacted finance, e‑commerce, government and social platforms, prompting analysis of technical and managerial failures and a set of remediation steps for individuals, enterprises and the industry.

KYC securityPrivacycloud misconfiguration

0 likes · 10 min read

When 1 Billion IDs Leak: Inside the Biggest Identity Verification Breach Ever

MaGe Linux Operations

Mar 4, 2026 · Information Security

Master Linux Intrusion Detection & Incident Response: A Practical Hands‑On Guide

This comprehensive guide walks you through building a layered Linux intrusion detection system, configuring host‑based tools such as AIDE, rkhunter, and auditd, automating security audits, performing forensic investigations, and executing a six‑step incident response workflow to detect, contain, and remediate attacks effectively.

AIDEForensicsHIDS

0 likes · 59 min read

Master Linux Intrusion Detection & Incident Response: A Practical Hands‑On Guide

Raymond Ops

Feb 25, 2026 · Operations

How to Stop 3 AM Alert Wake‑Ups: 5 Smart Monitoring Techniques

Every night engineers are jolted awake by noisy alerts, but by applying five practical techniques—including alert severity tiers, aggregation, dynamic thresholds, intelligent routing, and data‑driven effectiveness analysis—teams can cut daily alerts from over a hundred to fewer than ten and dramatically improve response times.

AlertingAlertmanagerMonitoring

0 likes · 44 min read

How to Stop 3 AM Alert Wake‑Ups: 5 Smart Monitoring Techniques

Ops Community

Feb 12, 2026 · Operations

Why Did Our Nginx Hit Connection Limits? A Deep Dive into Misdiagnosis and Rate‑Limiting Redesign

This postmortem explains how a Nginx connection‑saturation incident was initially misidentified as traffic surge, details the metrics and command‑line checks that revealed a connection‑lifecycle failure, and describes the step‑by‑step redesign of rate‑limiting, budgeting, monitoring, and run‑book procedures that restored stability.

MonitoringNGINXconnection limits

0 likes · 32 min read

Why Did Our Nginx Hit Connection Limits? A Deep Dive into Misdiagnosis and Rate‑Limiting Redesign

Xiao Liu Lab

Feb 12, 2026 · Information Security

When fail2ban Became a Monero Miner: Detection, Removal, and Prevention

A temporary test server on Tianyi Cloud was compromised by a malicious XMRig miner masquerading as fail2ban, causing CPU usage to skyrocket; the article details how the intrusion was discovered, the forensic steps taken, and a comprehensive remediation and hardening guide to prevent similar attacks.

CPU SpikeLinux securityfail2ban

0 likes · 9 min read

When fail2ban Became a Monero Miner: Detection, Removal, and Prevention

Raymond Ops

Jan 20, 2026 · Information Security

How to Build a Complete Linux Enterprise Security Framework—from Intrusion Detection to Incident Response

This guide walks through a real-world DDoS and SSH brute‑force incident and shows how to design a layered Linux security architecture, configure firewalls, host hardening, OSSEC HIDS, Suricata IDS, ELK monitoring, automated response scripts, and continuous improvement metrics for enterprise environments.

AutomationIDSLinux

0 likes · 15 min read

How to Build a Complete Linux Enterprise Security Framework—from Intrusion Detection to Incident Response

Ray's Galactic Tech

Jan 15, 2026 · Operations

Ultimate Production Incident Response Handbook: Quick Commands, Root Cause Analysis, and Preventive Architecture

This comprehensive guide presents a unified framework for diagnosing and resolving production incidents—covering CPU spikes, OOM, disk exhaustion, log overload, port failures, container crashes, Kubernetes pod issues, SSH attacks, I/O bottlenecks, MySQL connection limits, Redis memory saturation, message‑queue backlogs, deployment failures, certificate expirations, file‑handle exhaustion, time drift, mining malware, and DDoS—by providing rapid‑check commands, immediate remediation steps, root‑cause classification, and architectural safeguards.

LinuxOperationsProduction

0 likes · 11 min read

Ultimate Production Incident Response Handbook: Quick Commands, Root Cause Analysis, and Preventive Architecture

Raymond Ops

Jan 15, 2026 · Information Security

Master Linux Server Intrusion Detection & Response: A Complete Practical Guide

This guide walks Linux administrators through a full‑cycle intrusion detection and emergency response process, covering metric monitoring, log analysis, file integrity checks, attack confirmation, staged remediation, preventive hardening, and useful automation scripts to keep servers secure.

AutomationIntrusion DetectionLinux

0 likes · 16 min read

Master Linux Server Intrusion Detection & Response: A Complete Practical Guide

Ops Community

Jan 4, 2026 · Operations

How a Missed Domain Renewal Crashed Our Site for 2 Hours – Full DNS Outage Postmortem

At 3:07 AM on August 15 2025 a critical alert indicated the entire site was inaccessible, leading to a 2‑hour, 500 k‑user outage caused by an expired domain that entered serverHold status, and this postmortem details the detection, root‑cause analysis, emergency recovery steps, and long‑term remediation measures.

DNSOutagedomain renewal

0 likes · 19 min read

How a Missed Domain Renewal Crashed Our Site for 2 Hours – Full DNS Outage Postmortem

Raymond Ops

Dec 26, 2025 · Information Security

How to Respond When Your Server Is Compromised: Essential Incident Response and Forensics for Ops

This guide walks operations engineers through recognizing intrusion indicators, executing rapid detection scripts, following a structured 24‑hour response workflow, performing comprehensive digital forensics, and applying cleanup and hardening measures to secure compromised servers and prevent future attacks.

Digital ForensicsSystem Hardeningincident response

0 likes · 15 min read

How to Respond When Your Server Is Compromised: Essential Incident Response and Forensics for Ops

Ops Community

Dec 21, 2025 · Information Security

How to Investigate and Harden a Compromised Linux Server: Real-World Case Study

This guide walks through a real incident where a Linux server was hijacked by a mining virus, detailing step‑by‑step emergency response, systematic forensic investigation, cleanup procedures, and hardening measures to prevent future breaches, complete with scripts and best‑practice recommendations.

Intrusion DetectionLinuxMonitoring

0 likes · 26 min read

How to Investigate and Harden a Compromised Linux Server: Real-World Case Study

Efficient Ops

Dec 14, 2025 · Information Security

Detect and Respond to Linux Server Intrusions with Log Analysis

This guide walks you through using Linux log tools such as last, lastb, grep, and sshd_config to identify suspicious logins, trace malicious IPs, and apply immediate remediation steps for compromised servers, targeting ops engineers and developers.

ForensicsLinuxincident response

0 likes · 8 min read

Detect and Respond to Linux Server Intrusions with Log Analysis

MaGe Linux Operations

Dec 10, 2025 · Operations

Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

This handbook presents a complete, two‑year‑tested SRE on‑call process that defines alert severity tiers, response requirements, escalation paths, War‑Room roles, handoff schedules, post‑mortem procedures, and provides ready‑to‑use configuration snippets, checklists and templates to reduce MTTR and repeat incidents.

Alert ManagementOn-CallOperations

0 likes · 26 min read

Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

Architect's Guide

Nov 20, 2025 · Operations

What Caused Cloudflare’s Half‑Internet Outage? A Deep Dive into the Technical Failure

Cloudflare suffered a massive multi‑hour outage that knocked offline popular sites and AI services, traced to a sudden traffic spike, a mis‑configured Rust‑based bot‑management module, and a database permission change that doubled a feature file size, overwhelming its routing software.

CDNClickHouseCloudflare

0 likes · 12 min read

What Caused Cloudflare’s Half‑Internet Outage? A Deep Dive into the Technical Failure

Bilibili Tech

Nov 7, 2025 · Information Security

How AI-Driven Automation Transforms Security Alert Operations and Incident Tracing

This article explores the evolution of security alert automation from manual verification to SOAR and AI-driven solutions, detailing MCP-based AI agents, integration with various security tools, practical case studies of honey‑pot, HIDS, and EDR alert tracing, and the resulting efficiency gains and future outlook.

AIAlert AnalysisMCP

0 likes · 16 min read

How AI-Driven Automation Transforms Security Alert Operations and Incident Tracing

Liangxu Linux

Oct 26, 2025 · Information Security

Master Linux Server Intrusion Detection & Rapid Incident Response: A Complete Hands‑On Guide

This comprehensive guide walks Linux administrators through early detection of system anomalies, detailed log analysis, file‑integrity checks, intrusion confirmation, step‑by‑step emergency response, system hardening, preventive monitoring, and essential open‑source security tools, all illustrated with ready‑to‑run Bash scripts.

Intrusion DetectionLinuxMonitoring

0 likes · 17 min read

Master Linux Server Intrusion Detection & Rapid Incident Response: A Complete Hands‑On Guide

MaGe Linux Operations

Oct 16, 2025 · Operations

SRE Playbook: From Alert to Full Recovery of Service Avalanches

This comprehensive SRE guide walks through a real-world service avalanche incident, detailing alert triggering, root‑cause analysis, step‑by‑step recovery, capacity baseline creation, layered alert design, automated scripts, and post‑mortem best practices to help engineers prevent and resolve large‑scale outages.

AlertingSREService Avalanche

0 likes · 20 min read

SRE Playbook: From Alert to Full Recovery of Service Avalanches

Rare Earth Juejin Tech Community

Oct 13, 2025 · Backend Development

How Arthas Saved a Double‑11 Sale: Debugging a Thread‑Pool Nightmare

During a Double‑11 promotion, a massive request timeout caused order success rates to plunge, but by using Arthas, jstack, and targeted code analysis, the team identified a non‑thread‑safe HashBiMap in a global cache, halted the outage, and implemented fixes to prevent future failures.

HashBiMapJavaarthas

0 likes · 15 min read

How Arthas Saved a Double‑11 Sale: Debugging a Thread‑Pool Nightmare

Open Source Linux

Oct 9, 2025 · Information Security

Essential Incident Response & Forensics Guide for Server Intrusions

This article provides a comprehensive step‑by‑step process for detecting server compromises, collecting system, memory, and network evidence, analyzing logs, isolating the affected host, removing malicious artifacts, and hardening the environment to prevent future attacks.

ForensicsMonitoringScript

0 likes · 15 min read

Essential Incident Response & Forensics Guide for Server Intrusions

Ops Community

Sep 24, 2025 · Operations

How Ops Engineers Can Stop Online Outages in Minutes: A Proven Emergency Playbook

This article outlines why a solid incident‑response plan is critical, describes typical failure scenarios, introduces the 3‑5‑10 rule for rapid diagnosis and mitigation, provides ready‑to‑run scripts for system checks, traffic throttling, service rollback, and showcases automation, AIOps and chaos‑engineering techniques to turn reactive firefighting into proactive resilience.

AIOpsMonitoringemergency plan

0 likes · 18 min read

How Ops Engineers Can Stop Online Outages in Minutes: A Proven Emergency Playbook

Ops Community

Sep 18, 2025 · Information Security

Essential Linux Security: Common Vulnerabilities and Practical Defense Strategies

This guide walks you through the most critical Linux security flaws—from privilege‑escalation and misconfigured sudo to SSH, web server, kernel, and container risks—offering concrete hardening steps, logging practices, firewall rules, incident‑response procedures, and compliance tips to build a resilient production environment.

Linux securityLog MonitoringSSH Hardening

0 likes · 16 min read

Essential Linux Security: Common Vulnerabilities and Practical Defense Strategies

Rare Earth Juejin Tech Community

Sep 11, 2025 · Backend Development

How a Single Looped Serialization Turned a Major Promotion into a System Avalanche

A 2021 midnight promotion in Hangzhou crashed when a poorly placed loop serialized a massive object twenty times per request, overwhelming CPU, thread pools, and the Tair cache, leading to a full‑stack service avalanche that was only resolved after a half‑hour emergency rollback.

CachingMonitoringincident response

0 likes · 10 min read

How a Single Looped Serialization Turned a Major Promotion into a System Avalanche

dbaplus Community

Sep 3, 2025 · Operations

How to Build System Stability: Definitions, Challenges, and Practical Steps

This article explains what system stability means, why it matters, the difficulties of building it, and provides a detailed, step‑by‑step framework—including risk formulas, resource planning, monitoring, and emergency response—to help backend teams improve reliability and reduce business impact.

MonitoringRisk Managementincident response

0 likes · 23 min read

How to Build System Stability: Definitions, Challenges, and Practical Steps

MaGe Linux Operations

Aug 24, 2025 · Operations

Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox

This comprehensive guide shares a veteran ops engineer's real‑world troubleshooting mindset, the SEAL framework, a curated toolbox of monitoring, logging, performance, and network utilities, detailed case studies, incident‑response grading, automation scripts, and future‑ready AIOps practices for keeping production systems stable.

AutomationMonitoringSRE

0 likes · 19 min read

Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox

MaGe Linux Operations

Aug 10, 2025 · Operations

How to Resolve 100% CPU Outages in Under 3 Minutes: A Real‑World Emergency Guide

This article walks through a real‑world 100% CPU incident on an e‑commerce platform, showing how to detect the problem within seconds, analyze Java threads, apply quick emergency fixes, implement permanent refactoring, and set up long‑term monitoring to prevent future outages.

CPUJavaOperations

0 likes · 14 min read

How to Resolve 100% CPU Outages in Under 3 Minutes: A Real‑World Emergency Guide

Liangxu Linux

Aug 9, 2025 · Information Security

How a Single Weak Password Sank a 158‑Year‑Old UK Logistics Firm

A 158‑year‑old British transport company was crippled by a ransomware attack after hackers guessed an employee's weak password, leading to full data encryption, massive financial loss, bankruptcy, and highlighting systemic IT security failures.

Akira groupCyberattackIT security

0 likes · 9 min read

How a Single Weak Password Sank a 158‑Year‑Old UK Logistics Firm

MaGe Linux Operations

Jul 28, 2025 · Information Security

How to Detect and Respond to Server Intrusions: A Complete 24‑Hour Incident Response Guide

This guide walks operations and security engineers through recognizing intrusion signs, executing a step‑by‑step 24‑hour response, collecting forensic evidence, cleaning and hardening the system, and building proactive monitoring to protect servers from future attacks.

AutomationForensicsLinux

0 likes · 16 min read

How to Detect and Respond to Server Intrusions: A Complete 24‑Hour Incident Response Guide

Efficient Ops

Jul 8, 2025 · Information Security

How the SafePay Ransomware Crippled Ingram Micro’s Global Operations

On July 4, 2025, Ingram Micro, the world’s largest IT distributor, suffered a crippling ransomware attack by the SafePay group that stole nearly 1 TB of confidential data, encrypted critical systems, and forced a 48‑hour outage, highlighting severe risks for global supply‑chain operations.

CyberattackIngram MicroSafePay

0 likes · 3 min read

How the SafePay Ransomware Crippled Ingram Micro’s Global Operations

dbaplus Community

Jun 23, 2025 · Operations

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

This article shares a year‑long, hands‑on experience of improving backend alert governance at Tencent Meeting, covering why alerts are hard, designing segmented error codes, building unified alert policies, driving team silence‑up, measuring progress, and the tools that make the process sustainable.

Alert ManagementMonitoringbackend operations

0 likes · 42 min read

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

Cognitive Technology Team

Jun 17, 2025 · Cloud Computing

What a Single NullPointerException Taught Us About Cloud Reliability

The June 2025 Google Cloud outage, caused by an untested code change that triggered a NullPointerException, crippled over 70 core services worldwide, prompting a rapid technical fix, public apology, and industry‑wide reflections on cloud stability, fault tolerance, and deployment practices.

Google CloudNullPointerExceptioncloud outage

0 likes · 7 min read

What a Single NullPointerException Taught Us About Cloud Reliability

Zuoyebang Tech Team

Jun 12, 2025 · Information Security

How AI‑Powered RAG and Agents Are Revolutionizing Enterprise Security Operations

This article explains how the rise of AI large‑model technology and Retrieval‑Augmented Generation (RAG) combined with autonomous AI agents enable a three‑layer network‑boundary defense, address deep operational challenges such as alert overload and response latency, and dramatically improve incident‑response efficiency in large‑scale enterprises.

AI AgentsAI securityRAG

0 likes · 16 min read

How AI‑Powered RAG and Agents Are Revolutionizing Enterprise Security Operations

Efficient Ops

Jun 9, 2025 · Operations

How OnCall Platforms Transform Incident Management and Reduce Manual Overhead

This article explains the purpose and key features of OnCall platforms, compares popular solutions like PagerDuty, Opsgenie, Grafana OnCall and Alibaba Cloud ARMS, clarifies webhooks with a simple analogy, and summarizes how centralized on‑call management boosts operational efficiency while minimizing manual intervention.

Oncallincident responsewebhook

0 likes · 5 min read

How OnCall Platforms Transform Incident Management and Reduce Manual Overhead

Efficient Ops

May 20, 2025 · Information Security

How an Overseas Hacker Group Disrupted a Guangzhou Tech Company's Services

A coordinated overseas cyber‑attack breached a Guangzhou tech firm's self‑service equipment backend, causing hours of service outage, data leakage, and significant losses, prompting swift police investigation, evidence preservation, and a detailed technical analysis of the attackers' methods.

Chinacybersecurityhacker group

0 likes · 4 min read

How an Overseas Hacker Group Disrupted a Guangzhou Tech Company's Services

ITPUB

May 3, 2025 · Information Security

20 Critical Server Operations You Must Never Do – Real Cases & Fixes

Based on analysis of over 500 enterprise server failure cases, this guide lists 20 absolutely prohibited server actions across six dimensions, each illustrated with a real incident and practical technical measures to prevent recurrence.

devopsincident responsesecurity best practices

0 likes · 14 min read

20 Critical Server Operations You Must Never Do – Real Cases & Fixes

dbaplus Community

Mar 5, 2025 · Operations

How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement

This article shares a step‑by‑step account of taking over a content‑risk stability role in early 2024, defining system stability, diagnosing recurring issues, and implementing a three‑phase framework—pre‑emptive reduction, impact mitigation, and post‑incident improvement—to boost success rates, cut incident response time, and achieve a modular architecture.

JVM OptimizationMonitoringSRE

0 likes · 20 min read

How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement

360 Zhihui Cloud Developer

Feb 27, 2025 · Operations

How 360’s Unified Alert Service Boosts System Reliability and Cuts MTTR

This article explains the importance, pain points, architecture, core capabilities, and future roadmap of the 360 Zhihui Cloud "Yunzhou" unified alert service, showing how it improves observability, reduces alert noise, and accelerates incident response for modern cloud‑native systems.

AlertingObservabilityOperations

0 likes · 14 min read

How 360’s Unified Alert Service Boosts System Reliability and Cuts MTTR

JD Cloud Developers

Feb 26, 2025 · Operations

How to Build Effective Business Monitoring Metrics for Reliable Operations

This guide explains the significance of business monitoring, differentiates technical and business metrics, outlines a step‑by‑step process for building a robust business indicator system, and shares practical methods, tools, and common pitfalls to ensure reliable, actionable monitoring in operations.

Operationsbusiness monitoringincident response

0 likes · 12 min read

How to Build Effective Business Monitoring Metrics for Reliable Operations

Efficient Ops

Feb 20, 2025 · Information Security

How a Maintenance Staff Leak Exposed Security Gaps and How to Prevent It

A recent case where a maintenance worker exploited device‑management flaws to steal confidential files for foreign spies highlights the need for heightened vigilance, strict self‑discipline, and prompt reporting, offering practical steps to safeguard against similar security breaches.

data leakageincident responseinformation security

0 likes · 4 min read

How a Maintenance Staff Leak Exposed Security Gaps and How to Prevent It

DataFunSummit

Feb 13, 2025 · Information Security

Building and Optimizing a Comprehensive Security System: Practices, Innovations, and Future Outlook

This article presents a detailed walkthrough of constructing a robust security architecture, covering single‑person security team strategies, risk perception and quantification, rapid incident response, automated detection, precise strike mechanisms, deterrence tactics, and forward‑looking plans for intelligent, data‑driven risk management.

AutomationRisk Managementfraud detection

0 likes · 21 min read

Building and Optimizing a Comprehensive Security System: Practices, Innovations, and Future Outlook

ITPUB

Feb 11, 2025 · Operations

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies deploy fragmented monitoring and observability tools yet still struggle to pinpoint incidents; this article analyzes the root causes—under‑utilized tools and scenario‑agnostic data—and offers practical steps to organize metrics, build layered insights, and improve fault‑resolution efficiency.

Data EngineeringMonitoringObservability

0 likes · 12 min read

Why Your Monitoring Fails and How to Build Effective Observability Data

Raymond Ops

Dec 26, 2024 · Information Security

How to Detect and Recover from a Linux Server Intrusion: A Step‑by‑Step Guide

This article details a real‑world Linux server breach, describing the symptoms, investigative commands, log analysis, malicious script removal, file attribute unlocking, and practical remediation steps, while highlighting key lessons and preventive measures for future security.

Intrusion DetectionLinuxRootkit Removal

0 likes · 16 min read

How to Detect and Recover from a Linux Server Intrusion: A Step‑by‑Step Guide

Xiaohongshu Tech REDtech

Dec 25, 2024 · Industry Insights

How Xiaohongshu’s Security Team Achieved Zero Defense Losses in Shanghai’s 2024 “Panshi Action”

In December 2024, Xiaohongshu’s information security team topped the Shanghai “Panshi Action” competition, earning top blue‑team honors and a zero‑loss defense record by leveraging real‑time traffic monitoring, big‑data analytics, rapid incident response, and successful attacker attribution.

big data analysiscybersecurityincident response

0 likes · 3 min read

How Xiaohongshu’s Security Team Achieved Zero Defense Losses in Shanghai’s 2024 “Panshi Action”

DevOps Operations Practice

Dec 8, 2024 · Information Security

Incident Report: Investigating and Removing a Server Malware Causing 100% CPU Usage

This article documents a step‑by‑step investigation of a compromised Linux server that exhibited 100% CPU usage, detailing process, network, and startup‑service analysis, the discovery of a cryptomining malware, and the complete removal procedure.

CPULinuxMalware

0 likes · 5 min read

Incident Report: Investigating and Removing a Server Malware Causing 100% CPU Usage

Ops Development Stories

Nov 8, 2024 · Operations

Building a Simple Cloud‑Native Alert Platform: Features, Architecture & Roadmap

This article describes the design and implementation of a lightweight cloud‑native alert platform, outlining its core features, future enhancements, system architecture, and demo screenshots, offering practical insights for SREs and operations teams handling growing monitoring workloads.

Alert ManagementMonitoringOperations

0 likes · 6 min read

Building a Simple Cloud‑Native Alert Platform: Features, Architecture & Roadmap

Java Architect Essentials

Oct 7, 2024 · Information Security

Insider Ransomware Attack by a Former Engineer: Case Study and Security Lessons

A disgruntled former infrastructure engineer at a U.S. industrial firm deleted backups, locked administrators, and demanded $750,000 in Bitcoin, leading to his arrest and highlighting the severe risks, legal consequences, and mitigation strategies associated with insider ransomware threats.

IT Governanceincident responseinformation security

0 likes · 10 min read

Insider Ransomware Attack by a Former Engineer: Case Study and Security Lessons

Software Development Quality

Sep 25, 2024 · Information Security

Understanding Security Vulnerability Grading: Levels, Upgrade Rules, and Common Types

This article explains a security vulnerability grading standard that defines five severity levels (S0‑S4), outlines handling timeframes, describes conditions for automatic level upgrades, and lists typical vulnerability types for each level to guide effective risk management.

Risk ManagementVulnerabilitygrading

0 likes · 8 min read

Understanding Security Vulnerability Grading: Levels, Upgrade Rules, and Common Types

Huolala Tech

Sep 19, 2024 · Operations

How to Build a Team‑Wide Incident Response Platform for Seamless Online Ops

This article details XiaoBai's journey from struggling with ad‑hoc incident handling to designing a comprehensive platform that captures anomaly data, diagnoses root causes, and enables every team member to respond quickly and consistently, ultimately achieving a "everyone can respond" operation model.

Platform designRoot Cause Analysisbackend

0 likes · 14 min read

How to Build a Team‑Wide Incident Response Platform for Seamless Online Ops

Open Source Linux

Aug 23, 2024 · Operations

10 Proven Ops Practices to Prevent System Failures

This article shares ten practical operations strategies—including change rollbacks, safe handling of destructive commands, prompt customization, rigorous backup and verification, production environment discipline, careful handovers, robust alerting, cautious automatic failover, meticulous checks, and simplicity—to dramatically improve system reliability and availability.

LinuxMonitoringOperations

0 likes · 17 min read

10 Proven Ops Practices to Prevent System Failures

Efficient Ops

Aug 20, 2024 · Information Security

Building a One‑Person PCI DSS Image‑Signing Service and Surviving a P0 Outage

This article recounts how a solo developer built a Django‑based Docker image signing service to meet PCI DSS requirements, faced two severe incidents—including a 17.5‑hour P0 outage caused by concurrency limits and a misconfigured Rekor service—and shares the operational lessons learned for reliable SRE practice.

DjangoPCI-DSSSRE

0 likes · 9 min read

Building a One‑Person PCI DSS Image‑Signing Service and Surviving a P0 Outage

21CTO

Jul 23, 2024 · Information Security

What the Microsoft Blue‑Screen Crisis Teaches About IT Risk Management

The massive Microsoft blue‑screen outage caused by a faulty CrowdStrike update highlights the dangers of single‑system reliance, poor code quality, insufficient QA, and the need for staged rollouts, robust backup, real‑time monitoring, and proactive incident‑response strategies for modern IT organizations.

Disaster RecoveryIT OperationsMonitoring

0 likes · 10 min read

What the Microsoft Blue‑Screen Crisis Teaches About IT Risk Management

JD Tech

Jul 8, 2024 · Operations

System Stability Practices: From Development to Production

This article outlines comprehensive system stability strategies for backend development, covering technical design reviews, key reliability techniques such as rate limiting, circuit breaking, timeout handling, isolation, and deployment safeguards like monitoring, gray releases, and rollback, aiming to reduce incidents and improve operational resilience.

Monitoringincident responsesystem stability

0 likes · 26 min read

System Stability Practices: From Development to Production

JD Cloud Developers

Jul 5, 2024 · Information Security

How to Rapidly Respond to the Critical OpenSSH CVE‑2024‑6387 0‑Day Threat

This article examines the critical CVE‑2024‑6387 OpenSSH Server 0‑day vulnerability, explains its exploitation mechanics, and outlines effective emergency response strategies, including JD Cloud’s security operations solutions, to help enterprises swiftly mitigate risks, manage attack surfaces, and strengthen overall information security posture.

0day vulnerabilityCVE-2024-6387JD Cloud

0 likes · 11 min read

How to Rapidly Respond to the Critical OpenSSH CVE‑2024‑6387 0‑Day Threat

DevOps Coach

Jun 27, 2024 · Operations

How to Run Effective Incident Response Drills for Resilient Systems

This article explains why regular disaster role‑playing, systematic testing, and focused responder preparation are essential for building robust incident response capabilities and reducing operational risk in production environments.

OperationsResilienceSRE

0 likes · 7 min read

How to Run Effective Incident Response Drills for Resilient Systems

Efficient Ops

May 21, 2024 · Operations

What Is an SRE? Roles, Skills, and Best Practices Explained

This article demystifies Site Reliability Engineering (SRE) by explaining its origins, core responsibilities, essential skill sets, and key practices such as observability, incident response, testing, capacity planning, automation, user support, on‑call duties, and the definition of SLI/SLO/SLA, providing a comprehensive guide for modern operations teams.

SREcapacity planningincident response

0 likes · 29 min read

What Is an SRE? Roles, Skills, and Best Practices Explained

ITPUB

May 18, 2024 · Operations

How One Mistyped SQL Wiped All Orders—and the 45‑Minute Recovery That Followed

A quiet Saturday turned into a disaster when a simple UPDATE query accidentally deleted every order in production, prompting a rapid, step‑by‑step recovery, a post‑mortem analysis of the root causes, and a set of hard‑won operational lessons for any engineering team.

SQLincident responsepostmortem

0 likes · 8 min read

How One Mistyped SQL Wiped All Orders—and the 45‑Minute Recovery That Followed

ITPUB

May 7, 2024 · Operations

How Tiny Mistakes Turned Into Massive Outages: Real‑World Ops Incident Stories

A collection of firsthand accounts reveals how seemingly harmless actions—changing system time, mistyping a script name, accidental deletions, and reckless debugging—triggered large‑scale service disruptions, forced emergency rollbacks, and costly penalties, highlighting the high stakes of operational negligence.

Outageincident responsesystem-administration

0 likes · 10 min read

How Tiny Mistakes Turned Into Massive Outages: Real‑World Ops Incident Stories

Wukong Talks Architecture

Apr 15, 2024 · Operations

Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures

On April 8, a Tencent Cloud API outage caused console login failures for nearly 2,000 customers, affecting several dependent services for 87 minutes, and the detailed root‑cause analysis and subsequent improvement actions are presented to enhance system resilience and change management.

APICloudOperations

0 likes · 8 min read

Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures

Open Source Linux

Apr 15, 2024 · Operations

What Caused the 87‑Minute Tencent Cloud API Outage and How It Was Fixed?

On April 8, 2024, a cloud API failure disrupted Tencent Cloud's console for 87 minutes, affecting 1,957 customers, prompting a detailed incident review that uncovered version‑compatibility and configuration‑data issues and led to a set of operational improvements.

Change ManagementOperationsapi failure

0 likes · 8 min read

What Caused the 87‑Minute Tencent Cloud API Outage and How It Was Fixed?

dbaplus Community

Feb 25, 2024 · Databases

How a Simple UPDATE Wiped My Production Database—and the Lessons I Learned

After a weekend support ticket led to a reckless UPDATE that erased all orders in a production PostgreSQL database, the author details the rapid recovery steps, analyzes the human errors behind the disaster, draws lessons from Chernobyl, and outlines concrete post‑mortem improvements to prevent future data loss.

DatabasesRecoverySQL

0 likes · 7 min read

How a Simple UPDATE Wiped My Production Database—and the Lessons I Learned

Zhuanzhuan Tech

Feb 21, 2024 · Operations

Network Operations Incident Report: BGP Routing Failure and Resolution

This report details a network operations incident where a BGP routing change caused an EBGP neighbor to go idle, outlines the step‑by‑step troubleshooting, analysis of the root cause, and the implemented solution involving a new L3 node and redundant EBGP peers.

BGPCloud NetworkingNetwork Operations

0 likes · 8 min read

Network Operations Incident Report: BGP Routing Failure and Resolution

Efficient Ops

Jan 29, 2024 · Operations

Mastering Incident Response: A Practical Guide to Faster Service Recovery

This guide walks ops teams through real‑world incident scenarios, from quick symptom identification and emergency recovery to improving monitoring, crafting concise emergency plans, and leveraging automation for smarter fault handling, helping organizations restore services faster and reduce downtime.

Monitoringemergency planningfault handling

0 likes · 14 min read

Mastering Incident Response: A Practical Guide to Faster Service Recovery

Huolala Tech

Jan 16, 2024 · Information Security

How Graph Databases Revolutionize Host Security Incident Response

This article explores how HuoLala's host security HIDS leverages Neo4j graph databases and the Neovis.js visualization library to unify process, network, and file data, enabling rapid attack‑chain reconstruction, efficient multi‑cloud incident response, and improved security operations.

CypherHost SecurityNeo4j

0 likes · 16 min read

How Graph Databases Revolutionize Host Security Incident Response

ITPUB

Dec 27, 2023 · Operations

When a Snapshot Share Became a Data Leak: Lessons from a Cloud Ops Failure

A developer mistakenly set a cloud disk snapshot to public, exposing a major client’s data, and recounts the frantic rollback, the ensuing panic among teammates, and the hard‑won operational lessons about high‑risk manual tasks, proper safeguards, and the need for visualized tooling.

Data SecurityOperationsRisk Management

0 likes · 10 min read

When a Snapshot Share Became a Data Leak: Lessons from a Cloud Ops Failure

dbaplus Community

Dec 10, 2023 · Operations

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability Engineering

Drawing on twenty years of Google SRE experience, this article outlines eleven practical lessons—from scaling mitigation to disaster‑resilience testing—that help teams design, operate, and evolve reliable large‑scale services.

Disaster RecoverySREcanary releases

0 likes · 12 min read

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability Engineering

Software Development Quality

Nov 28, 2023 · Information Security

D‑Eyes: Fast Incident‑Response Scanning for Ransomware, Malware & Host Configs

D‑Eyes is an open‑source detection and response tool from NSFOCUS that runs on Windows and Linux, offering command‑line utilities to scan files, processes, host information, network connections, and perform baseline and software‑supply‑chain checks, with built‑in YARA rules for ransomware, mining malware, botnets, and webshells.

LinuxSecurity ToolWindows

0 likes · 9 min read

D‑Eyes: Fast Incident‑Response Scanning for Ransomware, Malware & Host Configs

Architect

Nov 17, 2023 · Information Security

A Real-World Incident of Accidental Public Snapshot Sharing and Lessons Learned

The author recounts a 2018 incident where a cloud disk snapshot was unintentionally made public, exposing customer data, and shares a detailed reflection on the operational mistakes, risk management failures, and recommended safeguards for high‑risk cloud operations.

Cloud ComputingData Securityincident response

0 likes · 9 min read

A Real-World Incident of Accidental Public Snapshot Sharing and Lessons Learned

Data Thinking Notes

Nov 16, 2023 · Operations

How to Build Robust Data Fault Governance: A Three‑Phase Stability Blueprint

This article outlines a comprehensive data fault governance framework that classifies metrics, defines three development phases, establishes fault‑grading standards, clarifies responsibilities across development, data‑warehouse, and analytics teams, and implements pre‑, during‑, and post‑incident safeguards to dramatically reduce fault frequency and recovery time.

Automationcross-team collaborationdata stability

0 likes · 15 min read

How to Build Robust Data Fault Governance: A Three‑Phase Stability Blueprint

dbaplus Community

Nov 15, 2023 · Operations

How a Public Snapshot Leak Almost Cost a Client – Lessons from a Cloud Ops Failure

A cloud engineer mistakenly set a disk snapshot to public, exposing a major client’s data, rushed a rollback, and then reflected on the root causes, highlighting the need for strict review, visual tools, and risk‑aware practices in high‑risk operations.

Cloud ComputingData SecurityOperations

0 likes · 9 min read

How a Public Snapshot Leak Almost Cost a Client – Lessons from a Cloud Ops Failure

JD Retail Technology

Nov 13, 2023 · Information Security

Red‑Blue Adversarial Testing for a Big Data Platform: Process, Benefits, and Best Practices

This article outlines the red‑blue adversarial testing process for a big‑data platform during the Double‑Eleven promotion, detailing its purpose, benefits, step‑by‑step execution, common issues, and recommendations to improve system reliability and security.

chaos engineeringincident responseinformation security

0 likes · 12 min read

Red‑Blue Adversarial Testing for a Big Data Platform: Process, Benefits, and Best Practices

Architecture and Beyond

Nov 12, 2023 · Frontend Development

Designing a Yellow Banner System for User Notification During Service Outages

The article explains how a configurable yellow banner system can be used on web interfaces to promptly inform users about service disruptions, guide their actions, increase transparency, improve experience, and outline implementation considerations such as configurability, persistence, and independent deployment.

NotificationSystem Designfrontend

0 likes · 6 min read

Designing a Yellow Banner System for User Notification During Service Outages

JD Tech

Nov 10, 2023 · Operations

Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices

This article explains the concept and importance of MTTR (Mean Time To Repair), shows how to calculate it, and provides a comprehensive set of monitoring, alerting, rapid mitigation, tool‑assisted analysis, and team coordination techniques to significantly shorten incident resolution time and improve system reliability.

MTTROperationsReliability

0 likes · 26 min read

Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices

Data Thinking Notes

Nov 2, 2023 · Operations

How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse

This article details Bilibili's data quality assurance framework, covering its evolution across four data platform stages, the architecture of its quality data warehouse, core capabilities such as a complete assurance system, digital‑driven continuous optimization, and efficient incident handling, plus case studies, future plans, and a Q&A session.

Big DataBilibiliData Platform

0 likes · 27 min read

How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse

Su San Talks Tech

Oct 27, 2023 · Operations

What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review

This article walks through Yuque’s October 23 service disruption, detailing each timeline milestone, analyzing the root causes, highlighting the importance of monitoring and data integrity checks, and offering concrete post‑mortem recommendations to improve future incident handling.

Cloud ServicesDisaster RecoveryMonitoring

0 likes · 12 min read

What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review

Bilibili Tech

Sep 8, 2023 · Operations

Design, Implementation, and Governance of an Alert Management Platform

The article details Bilibili’s comprehensive alert‑management platform—its background, cloud‑vs‑self‑built solution comparison, closed‑loop design, distributed architecture, rule configuration, noise‑reduction, automated root‑cause analysis, and governance practices that cut weekly alerts from 1,000 to under 80, while outlining future enhancements.

Alert ManagementSREdevops

0 likes · 19 min read

Design, Implementation, and Governance of an Alert Management Platform

Didi Tech

Aug 31, 2023 · Big Data

Data Stability Construction and Fault Governance Practices at Didi Customer Service

Didi’s multi‑year data‑stability program for its customer‑service platform progressed through fault‑centered engineering, business‑aligned cross‑team work, and capability normalization, instituting pre‑, mid‑ and post‑fault safeguards, clear ownership, automated alerts and repair tools, which cut fault count by 42 % and more than doubled mean‑time‑to‑repair while boosting team communication and satisfaction.

AutomationData ReliabilityData Warehouse

0 likes · 16 min read

Data Stability Construction and Fault Governance Practices at Didi Customer Service

Efficient Ops

Aug 15, 2023 · Information Security

How I Recovered a Compromised Linux Server: Step‑by‑Step Incident Response

This article details a real‑world Linux server intrusion, describing the observed symptoms, the forensic investigation using commands like ps, top, last, and grep, the removal of malicious cron jobs and backdoors, and the lessons learned for securing SSH, file attributes, and cloud security groups.

Rootkitchattrcron

0 likes · 15 min read

How I Recovered a Compromised Linux Server: Step‑by‑Step Incident Response

dbaplus Community

Aug 1, 2023 · Information Security

How a Misconfigured Kubelet Led to Crypto Mining on Our Kubernetes Node – Lessons Learned

A self‑built Kubernetes cluster was compromised when an unprotected node with empty iptables and a kubelet that allowed anonymous API access was hijacked for Monero mining, prompting a detailed post‑mortem, root‑cause analysis, and hardening recommendations.

crypto miningincident responsekubelet

0 likes · 5 min read

How a Misconfigured Kubelet Led to Crypto Mining on Our Kubernetes Node – Lessons Learned

Continuous Delivery 2.0

Jul 31, 2023 · Information Security

15 Key Cybersecurity Metrics for Measuring and Improving Security Performance

The article outlines fifteen essential cybersecurity metrics—thirteen process indicators such as mean detection and response times, and two result indicators like data loss incidents and security ROI—to help organizations evaluate, monitor, and improve their security posture and inform investment decisions.

Risk Managementcybersecurityincident response

0 likes · 4 min read

15 Key Cybersecurity Metrics for Measuring and Improving Security Performance

IT Services Circle

Jul 30, 2023 · Fundamentals

Why Blaming a Single Developer for a Crash Is Misguided: Lessons from the Xiaohongshu Incident

The recent Xiaohongshu app crash sparked public outcry and a viral screenshot blaming a developer, but the article explains that software bugs are inevitable, responsibility lies with the whole team, and proper debugging, testing, and process improvements are the rational response.

bug handlingincident responsesoftware development

0 likes · 6 min read

Why Blaming a Single Developer for a Crash Is Misguided: Lessons from the Xiaohongshu Incident