Tagged articles
222 articles
Page 1 of 3
Black & White Path
Black & White Path
May 15, 2026 · Information Security

Twin Brothers Delete 96 Government Databases – A Privileged‑Account Failure Case Study

In 2025, twin brothers with prior cyber‑crime convictions exploited a privileged‑account gap at a federal‑service contractor, erased 96 government databases within six minutes, used AI to seek log‑clearing methods, and triggered a multi‑layered forensic and legal response that highlights critical gaps in identity‑access management, backup integrity, and insider‑threat detection.

AI-assisted attackMITRE ATT&CKdatabase deletion
0 likes · 13 min read
Twin Brothers Delete 96 Government Databases – A Privileged‑Account Failure Case Study
Black & White Path
Black & White Path
May 11, 2026 · Information Security

State‑Sponsored Actors Gain Root on Palo Alto PAN‑OS via Captive Portal Buffer Overflow

A detailed analysis of CVE‑2026‑0300 reveals how a nation‑backed group exploited a buffer‑overflow in PAN‑OS's Captive Portal to obtain root on Palo Alto firewalls, outlining the attack chain, affected versions, immediate mitigations, long‑term remediation, compliance impacts, and lessons learned.

CVE-2026-0300Captive PortalPAN-OS
0 likes · 12 min read
State‑Sponsored Actors Gain Root on Palo Alto PAN‑OS via Captive Portal Buffer Overflow
Ops Community
Ops Community
May 4, 2026 · Information Security

Investigating and Securing a Server After a Suspicious Login

When a production server shows unexpected high CPU usage and unknown login activity, this guide walks Linux ops engineers through confirming intrusion, stopping the attacker, tracing the attack path, removing backdoors, restoring system integrity, and applying hardening measures to prevent future breaches.

ForensicsHardeningLinux
0 likes · 27 min read
Investigating and Securing a Server After a Suspicious Login
FunTester
FunTester
Apr 27, 2026 · Operations

Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help

The article explains that large‑scale incidents overwhelm on‑call engineers who must manually piece together context from countless signals, and shows how a self‑healing automation platform can take over repetitive, known failure patterns, verify fixes, and reduce fatigue while keeping humans in the loop for oversight.

OperationsSREautomation
0 likes · 8 min read
Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help
Raymond Ops
Raymond Ops
Apr 20, 2026 · Operations

How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates

This article presents a complete SRE on‑call handbook that defines alert severity levels, provides concrete Prometheus Alertmanager configurations, outlines a step‑by‑step response flow, details war‑room roles, escalation paths, handoff checklists, post‑mortem procedures, and dozens of ready‑to‑use templates to reduce MTTR and improve reliability.

Alert ManagementOn-CallOperations
0 likes · 27 min read
How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates
Black & White Path
Black & White Path
Apr 17, 2026 · Information Security

Threat Alert: Cloud‑Native Cybercrime Group TeamPCP Targets Docker, Kubernetes, and Redis

TeamPCP, a newly identified cloud‑native threat group, has compromised at least 60,000 servers worldwide by exploiting exposed Docker APIs, Kubernetes clusters, Redis instances, and the React2Shell vulnerability, employing automated tools such as proxy.sh, kube.py, and react.py, with detailed MITRE ATT&CK mapping and concrete defense recommendations.

DockerKubernetesMITRE ATT&CK
0 likes · 16 min read
Threat Alert: Cloud‑Native Cybercrime Group TeamPCP Targets Docker, Kubernetes, and Redis
dbaplus Community
dbaplus Community
Apr 14, 2026 · Information Security

How to Investigate and Respond to Kubernetes Cluster Intrusions

This guide walks through practical techniques for detecting, tracing, and remediating Kubernetes cluster compromises, covering pod‑level debugging, node inspection, audit‑log analysis, and common attacker behaviors such as privileged pod creation and hostPath mounting.

Cluster ForensicsKubernetesPod Debugging
0 likes · 7 min read
How to Investigate and Respond to Kubernetes Cluster Intrusions
Black & White Path
Black & White Path
Apr 11, 2026 · Information Security

A Beginner’s Struggle: Securing a Compromised ThinkPHP Site Over Several Days

The author recounts a multi‑day incident response to a ThinkPHP website that was compromised via a weak admin password, detailing how repeated data tampering, hidden scheduled‑task scripts, and a ransom message were investigated, mitigated, and finally contained through systematic hardening and monitoring.

PHPServer HardeningThinkPHP
0 likes · 7 min read
A Beginner’s Struggle: Securing a Compromised ThinkPHP Site Over Several Days
Alibaba Cloud Native
Alibaba Cloud Native
Apr 10, 2026 · Cloud Native

How HiClaw Automates Crash Alert Analysis with AI Agents in a Cloud‑Native Environment

This article details the design and workflow of HiClaw, an AI‑driven, cloud‑native system that intercepts DingTalk crash alerts, isolates analysis in secure containers, and automatically generates actionable reports, dramatically reducing manual investigation time while complying with strict internal security policies.

AIautomationincident response
0 likes · 15 min read
How HiClaw Automates Crash Alert Analysis with AI Agents in a Cloud‑Native Environment
Black & White Path
Black & White Path
Apr 7, 2026 · Information Security

Ransomware ‘Shaming’ Attacks Surge: Over 2,000 Companies Exposed in 2026

Ransomware groups are increasingly using double‑extortion "shaming" tactics, publicly leaking stolen data to pressure victims, with Breachsense reporting more than 2,000 compromised firms in 2026, a 40% rise projected for the year, prompting new defensive strategies across industries.

cybersecuritydata breachdouble extortion
0 likes · 10 min read
Ransomware ‘Shaming’ Attacks Surge: Over 2,000 Companies Exposed in 2026
ITPUB
ITPUB
Mar 30, 2026 · Information Security

Essential Network Security FAQ: 100+ Key Concepts Explained

This comprehensive guide defines network security, outlines its core attributes, enumerates common threats and attack types, and provides practical mitigation strategies, covering everything from encryption basics and access controls to advanced topics like zero‑day vulnerabilities, zero‑trust architecture, and security automation.

Threatsaccess controlcybersecurity
0 likes · 44 min read
Essential Network Security FAQ: 100+ Key Concepts Explained
ITPUB
ITPUB
Mar 23, 2026 · Information Security

Essential Network Security Q&A: From Fundamentals to Advanced Threats

This comprehensive guide answers 100 common network security questions, covering basic concepts, core properties, threat sources, attack types, encryption methods, access controls, incident response, and emerging technologies such as zero‑trust, quantum encryption, and SOAR.

ThreatsVulnerabilityaccess control
0 likes · 44 min read
Essential Network Security Q&A: From Fundamentals to Advanced Threats
Black & White Path
Black & White Path
Mar 12, 2026 · Information Security

When 1 Billion IDs Leak: Inside the Biggest Identity Verification Breach Ever

A leading identity verification provider exposed over one billion personal records after a cloud storage bucket was misconfigured, revealing names, IDs, biometric data and more; the breach impacted finance, e‑commerce, government and social platforms, prompting analysis of technical and managerial failures and a set of remediation steps for individuals, enterprises and the industry.

KYC securityZero Trustcloud misconfiguration
0 likes · 10 min read
When 1 Billion IDs Leak: Inside the Biggest Identity Verification Breach Ever
MaGe Linux Operations
MaGe Linux Operations
Mar 4, 2026 · Information Security

Master Linux Intrusion Detection & Incident Response: A Practical Hands‑On Guide

This comprehensive guide walks you through building a layered Linux intrusion detection system, configuring host‑based tools such as AIDE, rkhunter, and auditd, automating security audits, performing forensic investigations, and executing a six‑step incident response workflow to detect, contain, and remediate attacks effectively.

AIDEAuditdForensics
0 likes · 59 min read
Master Linux Intrusion Detection & Incident Response: A Practical Hands‑On Guide
Raymond Ops
Raymond Ops
Feb 25, 2026 · Operations

How to Stop 3 AM Alert Wake‑Ups: 5 Smart Monitoring Techniques

Every night engineers are jolted awake by noisy alerts, but by applying five practical techniques—including alert severity tiers, aggregation, dynamic thresholds, intelligent routing, and data‑driven effectiveness analysis—teams can cut daily alerts from over a hundred to fewer than ten and dramatically improve response times.

AlertingAlertmanagerPrometheus
0 likes · 44 min read
How to Stop 3 AM Alert Wake‑Ups: 5 Smart Monitoring Techniques
Ops Community
Ops Community
Feb 12, 2026 · Operations

Why Did Our Nginx Hit Connection Limits? A Deep Dive into Misdiagnosis and Rate‑Limiting Redesign

This postmortem explains how a Nginx connection‑saturation incident was initially misidentified as traffic surge, details the metrics and command‑line checks that revealed a connection‑lifecycle failure, and describes the step‑by‑step redesign of rate‑limiting, budgeting, monitoring, and run‑book procedures that restored stability.

NGINXconnection limitsincident response
0 likes · 32 min read
Why Did Our Nginx Hit Connection Limits? A Deep Dive into Misdiagnosis and Rate‑Limiting Redesign
Xiao Liu Lab
Xiao Liu Lab
Feb 12, 2026 · Information Security

When fail2ban Became a Monero Miner: Detection, Removal, and Prevention

A temporary test server on Tianyi Cloud was compromised by a malicious XMRig miner masquerading as fail2ban, causing CPU usage to skyrocket; the article details how the intrusion was discovered, the forensic steps taken, and a comprehensive remediation and hardening guide to prevent similar attacks.

CPU SpikeFail2banLinux security
0 likes · 9 min read
When fail2ban Became a Monero Miner: Detection, Removal, and Prevention
Ray's Galactic Tech
Ray's Galactic Tech
Jan 15, 2026 · Operations

Ultimate Production Incident Response Handbook: Quick Commands, Root Cause Analysis, and Preventive Architecture

This comprehensive guide presents a unified framework for diagnosing and resolving production incidents—covering CPU spikes, OOM, disk exhaustion, log overload, port failures, container crashes, Kubernetes pod issues, SSH attacks, I/O bottlenecks, MySQL connection limits, Redis memory saturation, message‑queue backlogs, deployment failures, certificate expirations, file‑handle exhaustion, time drift, mining malware, and DDoS—by providing rapid‑check commands, immediate remediation steps, root‑cause classification, and architectural safeguards.

KubernetesLinuxOperations
0 likes · 11 min read
Ultimate Production Incident Response Handbook: Quick Commands, Root Cause Analysis, and Preventive Architecture
Raymond Ops
Raymond Ops
Jan 15, 2026 · Information Security

Master Linux Server Intrusion Detection & Response: A Complete Practical Guide

This guide walks Linux administrators through a full‑cycle intrusion detection and emergency response process, covering metric monitoring, log analysis, file integrity checks, attack confirmation, staged remediation, preventive hardening, and useful automation scripts to keep servers secure.

LinuxSecurityShell Scripts
0 likes · 16 min read
Master Linux Server Intrusion Detection & Response: A Complete Practical Guide
Raymond Ops
Raymond Ops
Dec 26, 2025 · Information Security

How to Respond When Your Server Is Compromised: Essential Incident Response and Forensics for Ops

This guide walks operations engineers through recognizing intrusion indicators, executing rapid detection scripts, following a structured 24‑hour response workflow, performing comprehensive digital forensics, and applying cleanup and hardening measures to secure compromised servers and prevent future attacks.

Server SecuritySystem Hardeningdigital forensics
0 likes · 15 min read
How to Respond When Your Server Is Compromised: Essential Incident Response and Forensics for Ops
Ops Community
Ops Community
Dec 21, 2025 · Information Security

How to Investigate and Harden a Compromised Linux Server: Real-World Case Study

This guide walks through a real incident where a Linux server was hijacked by a mining virus, detailing step‑by‑step emergency response, systematic forensic investigation, cleanup procedures, and hardening measures to prevent future breaches, complete with scripts and best‑practice recommendations.

LinuxRootkitServer Hardening
0 likes · 26 min read
How to Investigate and Harden a Compromised Linux Server: Real-World Case Study
Efficient Ops
Efficient Ops
Dec 14, 2025 · Information Security

Detect and Respond to Linux Server Intrusions with Log Analysis

This guide walks you through using Linux log tools such as last, lastb, grep, and sshd_config to identify suspicious logins, trace malicious IPs, and apply immediate remediation steps for compromised servers, targeting ops engineers and developers.

ForensicsLinuxSSH
0 likes · 8 min read
Detect and Respond to Linux Server Intrusions with Log Analysis
MaGe Linux Operations
MaGe Linux Operations
Dec 10, 2025 · Operations

Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

This handbook presents a complete, two‑year‑tested SRE on‑call process that defines alert severity tiers, response requirements, escalation paths, War‑Room roles, handoff schedules, post‑mortem procedures, and provides ready‑to‑use configuration snippets, checklists and templates to reduce MTTR and repeat incidents.

Alert ManagementOn-CallOperations
0 likes · 26 min read
Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates
Bilibili Tech
Bilibili Tech
Nov 7, 2025 · Information Security

How AI-Driven Automation Transforms Security Alert Operations and Incident Tracing

This article explores the evolution of security alert automation from manual verification to SOAR and AI-driven solutions, detailing MCP-based AI agents, integration with various security tools, practical case studies of honey‑pot, HIDS, and EDR alert tracing, and the resulting efficiency gains and future outlook.

AIAlert AnalysisMCP
0 likes · 16 min read
How AI-Driven Automation Transforms Security Alert Operations and Incident Tracing
Liangxu Linux
Liangxu Linux
Oct 26, 2025 · Information Security

Master Linux Server Intrusion Detection & Rapid Incident Response: A Complete Hands‑On Guide

This comprehensive guide walks Linux administrators through early detection of system anomalies, detailed log analysis, file‑integrity checks, intrusion confirmation, step‑by‑step emergency response, system hardening, preventive monitoring, and essential open‑source security tools, all illustrated with ready‑to‑run Bash scripts.

LinuxSecurity Scriptsincident response
0 likes · 17 min read
Master Linux Server Intrusion Detection & Rapid Incident Response: A Complete Hands‑On Guide
MaGe Linux Operations
MaGe Linux Operations
Oct 16, 2025 · Operations

SRE Playbook: From Alert to Full Recovery of Service Avalanches

This comprehensive SRE guide walks through a real-world service avalanche incident, detailing alert triggering, root‑cause analysis, step‑by‑step recovery, capacity baseline creation, layered alert design, automated scripts, and post‑mortem best practices to help engineers prevent and resolve large‑scale outages.

AlertingSREService Avalanche
0 likes · 20 min read
SRE Playbook: From Alert to Full Recovery of Service Avalanches
Open Source Linux
Open Source Linux
Oct 9, 2025 · Information Security

Essential Incident Response & Forensics Guide for Server Intrusions

This article provides a comprehensive step‑by‑step process for detecting server compromises, collecting system, memory, and network evidence, analyzing logs, isolating the affected host, removing malicious artifacts, and hardening the environment to prevent future attacks.

ForensicsServer Securityincident response
0 likes · 15 min read
Essential Incident Response & Forensics Guide for Server Intrusions
Ops Community
Ops Community
Sep 24, 2025 · Operations

How Ops Engineers Can Stop Online Outages in Minutes: A Proven Emergency Playbook

This article outlines why a solid incident‑response plan is critical, describes typical failure scenarios, introduces the 3‑5‑10 rule for rapid diagnosis and mitigation, provides ready‑to‑run scripts for system checks, traffic throttling, service rollback, and showcases automation, AIOps and chaos‑engineering techniques to turn reactive firefighting into proactive resilience.

aiopsemergency planincident response
0 likes · 18 min read
How Ops Engineers Can Stop Online Outages in Minutes: A Proven Emergency Playbook
Ops Community
Ops Community
Sep 18, 2025 · Information Security

Essential Linux Security: Common Vulnerabilities and Practical Defense Strategies

This guide walks you through the most critical Linux security flaws—from privilege‑escalation and misconfigured sudo to SSH, web server, kernel, and container risks—offering concrete hardening steps, logging practices, firewall rules, incident‑response procedures, and compliance tips to build a resilient production environment.

Container SecurityLinux securityLog Monitoring
0 likes · 16 min read
Essential Linux Security: Common Vulnerabilities and Practical Defense Strategies
dbaplus Community
dbaplus Community
Sep 3, 2025 · Operations

How to Build System Stability: Definitions, Challenges, and Practical Steps

This article explains what system stability means, why it matters, the difficulties of building it, and provides a detailed, step‑by‑step framework—including risk formulas, resource planning, monitoring, and emergency response—to help backend teams improve reliability and reduce business impact.

incident responsemonitoringrisk management
0 likes · 23 min read
How to Build System Stability: Definitions, Challenges, and Practical Steps
MaGe Linux Operations
MaGe Linux Operations
Aug 24, 2025 · Operations

Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox

This comprehensive guide shares a veteran ops engineer's real‑world troubleshooting mindset, the SEAL framework, a curated toolbox of monitoring, logging, performance, and network utilities, detailed case studies, incident‑response grading, automation scripts, and future‑ready AIOps practices for keeping production systems stable.

SREautomationincident response
0 likes · 19 min read
Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox
Liangxu Linux
Liangxu Linux
Aug 9, 2025 · Information Security

How a Single Weak Password Sank a 158‑Year‑Old UK Logistics Firm

A 158‑year‑old British transport company was crippled by a ransomware attack after hackers guessed an employee's weak password, leading to full data encryption, massive financial loss, bankruptcy, and highlighting systemic IT security failures.

Akira groupCyberattackIT security
0 likes · 9 min read
How a Single Weak Password Sank a 158‑Year‑Old UK Logistics Firm
Efficient Ops
Efficient Ops
Jul 8, 2025 · Information Security

How the SafePay Ransomware Crippled Ingram Micro’s Global Operations

On July 4, 2025, Ingram Micro, the world’s largest IT distributor, suffered a crippling ransomware attack by the SafePay group that stole nearly 1 TB of confidential data, encrypted critical systems, and forced a 48‑hour outage, highlighting severe risks for global supply‑chain operations.

CyberattackIngram MicroSafePay
0 likes · 3 min read
How the SafePay Ransomware Crippled Ingram Micro’s Global Operations
dbaplus Community
dbaplus Community
Jun 23, 2025 · Operations

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

This article shares a year‑long, hands‑on experience of improving backend alert governance at Tencent Meeting, covering why alerts are hard, designing segmented error codes, building unified alert policies, driving team silence‑up, measuring progress, and the tools that make the process sustainable.

Alert Managementbackend operationserror code design
0 likes · 42 min read
How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance
Cognitive Technology Team
Cognitive Technology Team
Jun 17, 2025 · Cloud Computing

What a Single NullPointerException Taught Us About Cloud Reliability

The June 2025 Google Cloud outage, caused by an untested code change that triggered a NullPointerException, crippled over 70 core services worldwide, prompting a rapid technical fix, public apology, and industry‑wide reflections on cloud stability, fault tolerance, and deployment practices.

Google Cloudcloud outageincident response
0 likes · 7 min read
What a Single NullPointerException Taught Us About Cloud Reliability
Zuoyebang Tech Team
Zuoyebang Tech Team
Jun 12, 2025 · Information Security

How AI‑Powered RAG and Agents Are Revolutionizing Enterprise Security Operations

This article explains how the rise of AI large‑model technology and Retrieval‑Augmented Generation (RAG) combined with autonomous AI agents enable a three‑layer network‑boundary defense, address deep operational challenges such as alert overload and response latency, and dramatically improve incident‑response efficiency in large‑scale enterprises.

AI AgentsAI securityRAG
0 likes · 16 min read
How AI‑Powered RAG and Agents Are Revolutionizing Enterprise Security Operations
Efficient Ops
Efficient Ops
Jun 9, 2025 · Operations

How OnCall Platforms Transform Incident Management and Reduce Manual Overhead

This article explains the purpose and key features of OnCall platforms, compares popular solutions like PagerDuty, Opsgenie, Grafana OnCall and Alibaba Cloud ARMS, clarifies webhooks with a simple analogy, and summarizes how centralized on‑call management boosts operational efficiency while minimizing manual intervention.

Oncallincident responsewebhook
0 likes · 5 min read
How OnCall Platforms Transform Incident Management and Reduce Manual Overhead
Efficient Ops
Efficient Ops
May 20, 2025 · Information Security

How an Overseas Hacker Group Disrupted a Guangzhou Tech Company's Services

A coordinated overseas cyber‑attack breached a Guangzhou tech firm's self‑service equipment backend, causing hours of service outage, data leakage, and significant losses, prompting swift police investigation, evidence preservation, and a detailed technical analysis of the attackers' methods.

Chinacybersecurityhacker group
0 likes · 4 min read
How an Overseas Hacker Group Disrupted a Guangzhou Tech Company's Services
ITPUB
ITPUB
May 3, 2025 · Information Security

20 Critical Server Operations You Must Never Do – Real Cases & Fixes

Based on analysis of over 500 enterprise server failure cases, this guide lists 20 absolutely prohibited server actions across six dimensions, each illustrated with a real incident and practical technical measures to prevent recurrence.

DevOpsSystem Administrationincident response
0 likes · 14 min read
20 Critical Server Operations You Must Never Do – Real Cases & Fixes
dbaplus Community
dbaplus Community
Mar 5, 2025 · Operations

How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement

This article shares a step‑by‑step account of taking over a content‑risk stability role in early 2024, defining system stability, diagnosing recurring issues, and implementing a three‑phase framework—pre‑emptive reduction, impact mitigation, and post‑incident improvement—to boost success rates, cut incident response time, and achieve a modular architecture.

JVM OptimizationSREincident response
0 likes · 20 min read
How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement
JD Cloud Developers
JD Cloud Developers
Feb 26, 2025 · Operations

How to Build Effective Business Monitoring Metrics for Reliable Operations

This guide explains the significance of business monitoring, differentiates technical and business metrics, outlines a step‑by‑step process for building a robust business indicator system, and shares practical methods, tools, and common pitfalls to ensure reliable, actionable monitoring in operations.

Operationsbusiness monitoringincident response
0 likes · 12 min read
How to Build Effective Business Monitoring Metrics for Reliable Operations
Efficient Ops
Efficient Ops
Feb 20, 2025 · Information Security

How a Maintenance Staff Leak Exposed Security Gaps and How to Prevent It

A recent case where a maintenance worker exploited device‑management flaws to steal confidential files for foreign spies highlights the need for heightened vigilance, strict self‑discipline, and prompt reporting, offering practical steps to safeguard against similar security breaches.

data leakageincident responseinformation security
0 likes · 4 min read
How a Maintenance Staff Leak Exposed Security Gaps and How to Prevent It
DataFunSummit
DataFunSummit
Feb 13, 2025 · Information Security

Building and Optimizing a Comprehensive Security System: Practices, Innovations, and Future Outlook

This article presents a detailed walkthrough of constructing a robust security architecture, covering single‑person security team strategies, risk perception and quantification, rapid incident response, automated detection, precise strike mechanisms, deterrence tactics, and forward‑looking plans for intelligent, data‑driven risk management.

SecuritySecurity Architectureautomation
0 likes · 21 min read
Building and Optimizing a Comprehensive Security System: Practices, Innovations, and Future Outlook
ITPUB
ITPUB
Feb 11, 2025 · Operations

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies deploy fragmented monitoring and observability tools yet still struggle to pinpoint incidents; this article analyzes the root causes—under‑utilized tools and scenario‑agnostic data—and offers practical steps to organize metrics, build layered insights, and improve fault‑resolution efficiency.

ObservabilitySREdata engineering
0 likes · 12 min read
Why Your Monitoring Fails and How to Build Effective Observability Data
Raymond Ops
Raymond Ops
Dec 26, 2024 · Information Security

How to Detect and Recover from a Linux Server Intrusion: A Step‑by‑Step Guide

This article details a real‑world Linux server breach, describing the symptoms, investigative commands, log analysis, malicious script removal, file attribute unlocking, and practical remediation steps, while highlighting key lessons and preventive measures for future security.

LinuxRootkit RemovalServer Security
0 likes · 16 min read
How to Detect and Recover from a Linux Server Intrusion: A Step‑by‑Step Guide
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Dec 25, 2024 · Industry Insights

How Xiaohongshu’s Security Team Achieved Zero Defense Losses in Shanghai’s 2024 “Panshi Action”

In December 2024, Xiaohongshu’s information security team topped the Shanghai “Panshi Action” competition, earning top blue‑team honors and a zero‑loss defense record by leveraging real‑time traffic monitoring, big‑data analytics, rapid incident response, and successful attacker attribution.

big data analysiscybersecurityincident response
0 likes · 3 min read
How Xiaohongshu’s Security Team Achieved Zero Defense Losses in Shanghai’s 2024 “Panshi Action”
Java Architect Essentials
Java Architect Essentials
Oct 7, 2024 · Information Security

Insider Ransomware Attack by a Former Engineer: Case Study and Security Lessons

A disgruntled former infrastructure engineer at a U.S. industrial firm deleted backups, locked administrators, and demanded $750,000 in Bitcoin, leading to his arrest and highlighting the severe risks, legal consequences, and mitigation strategies associated with insider ransomware threats.

IT Governanceincident responseinformation security
0 likes · 10 min read
Insider Ransomware Attack by a Former Engineer: Case Study and Security Lessons
Huolala Tech
Huolala Tech
Sep 19, 2024 · Operations

How to Build a Team‑Wide Incident Response Platform for Seamless Online Ops

This article details XiaoBai's journey from struggling with ad‑hoc incident handling to designing a comprehensive platform that captures anomaly data, diagnoses root causes, and enables every team member to respond quickly and consistently, ultimately achieving a "everyone can respond" operation model.

BackendRoot Cause Analysisincident response
0 likes · 14 min read
How to Build a Team‑Wide Incident Response Platform for Seamless Online Ops
Open Source Linux
Open Source Linux
Aug 23, 2024 · Operations

10 Proven Ops Practices to Prevent System Failures

This article shares ten practical operations strategies—including change rollbacks, safe handling of destructive commands, prompt customization, rigorous backup and verification, production environment discipline, careful handovers, robust alerting, cautious automatic failover, meticulous checks, and simplicity—to dramatically improve system reliability and availability.

BackupLinuxOperations
0 likes · 17 min read
10 Proven Ops Practices to Prevent System Failures
Efficient Ops
Efficient Ops
Aug 20, 2024 · Information Security

Building a One‑Person PCI DSS Image‑Signing Service and Surviving a P0 Outage

This article recounts how a solo developer built a Django‑based Docker image signing service to meet PCI DSS requirements, faced two severe incidents—including a 17.5‑hour P0 outage caused by concurrency limits and a misconfigured Rekor service—and shares the operational lessons learned for reliable SRE practice.

DjangoPCI DSSSRE
0 likes · 9 min read
Building a One‑Person PCI DSS Image‑Signing Service and Surviving a P0 Outage
21CTO
21CTO
Jul 23, 2024 · Information Security

What the Microsoft Blue‑Screen Crisis Teaches About IT Risk Management

The massive Microsoft blue‑screen outage caused by a faulty CrowdStrike update highlights the dangers of single‑system reliance, poor code quality, insufficient QA, and the need for staged rollouts, robust backup, real‑time monitoring, and proactive incident‑response strategies for modern IT organizations.

IT Operationsdisaster recoveryincident response
0 likes · 10 min read
What the Microsoft Blue‑Screen Crisis Teaches About IT Risk Management
JD Tech
JD Tech
Jul 8, 2024 · Operations

System Stability Practices: From Development to Production

This article outlines comprehensive system stability strategies for backend development, covering technical design reviews, key reliability techniques such as rate limiting, circuit breaking, timeout handling, isolation, and deployment safeguards like monitoring, gray releases, and rollback, aiming to reduce incidents and improve operational resilience.

incident responsemonitoringsystem stability
0 likes · 26 min read
System Stability Practices: From Development to Production
JD Cloud Developers
JD Cloud Developers
Jul 5, 2024 · Information Security

How to Rapidly Respond to the Critical OpenSSH CVE‑2024‑6387 0‑Day Threat

This article examines the critical CVE‑2024‑6387 OpenSSH Server 0‑day vulnerability, explains its exploitation mechanics, and outlines effective emergency response strategies, including JD Cloud’s security operations solutions, to help enterprises swiftly mitigate risks, manage attack surfaces, and strengthen overall information security posture.

0day vulnerabilityCVE-2024-6387JD Cloud
0 likes · 11 min read
How to Rapidly Respond to the Critical OpenSSH CVE‑2024‑6387 0‑Day Threat
Efficient Ops
Efficient Ops
May 21, 2024 · Operations

What Is an SRE? Roles, Skills, and Best Practices Explained

This article demystifies Site Reliability Engineering (SRE) by explaining its origins, core responsibilities, essential skill sets, and key practices such as observability, incident response, testing, capacity planning, automation, user support, on‑call duties, and the definition of SLI/SLO/SLA, providing a comprehensive guide for modern operations teams.

SREcapacity planningincident response
0 likes · 29 min read
What Is an SRE? Roles, Skills, and Best Practices Explained
ITPUB
ITPUB
May 18, 2024 · Operations

How One Mistyped SQL Wiped All Orders—and the 45‑Minute Recovery That Followed

A quiet Saturday turned into a disaster when a simple UPDATE query accidentally deleted every order in production, prompting a rapid, step‑by‑step recovery, a post‑mortem analysis of the root causes, and a set of hard‑won operational lessons for any engineering team.

SQLincident responsepostmortem
0 likes · 8 min read
How One Mistyped SQL Wiped All Orders—and the 45‑Minute Recovery That Followed
ITPUB
ITPUB
May 7, 2024 · Operations

How Tiny Mistakes Turned Into Massive Outages: Real‑World Ops Incident Stories

A collection of firsthand accounts reveals how seemingly harmless actions—changing system time, mistyping a script name, accidental deletions, and reckless debugging—triggered large‑scale service disruptions, forced emergency rollbacks, and costly penalties, highlighting the high stakes of operational negligence.

OutageSystem Administrationincident response
0 likes · 10 min read
How Tiny Mistakes Turned Into Massive Outages: Real‑World Ops Incident Stories
dbaplus Community
dbaplus Community
Feb 25, 2024 · Databases

How a Simple UPDATE Wiped My Production Database—and the Lessons I Learned

After a weekend support ticket led to a reckless UPDATE that erased all orders in a production PostgreSQL database, the author details the rapid recovery steps, analyzes the human errors behind the disaster, draws lessons from Chernobyl, and outlines concrete post‑mortem improvements to prevent future data loss.

RecoverySQLdatabases
0 likes · 7 min read
How a Simple UPDATE Wiped My Production Database—and the Lessons I Learned
Zhuanzhuan Tech
Zhuanzhuan Tech
Feb 21, 2024 · Operations

Network Operations Incident Report: BGP Routing Failure and Resolution

This report details a network operations incident where a BGP routing change caused an EBGP neighbor to go idle, outlines the step‑by‑step troubleshooting, analysis of the root cause, and the implemented solution involving a new L3 node and redundant EBGP peers.

BGPcloud networkingincident response
0 likes · 8 min read
Network Operations Incident Report: BGP Routing Failure and Resolution
Efficient Ops
Efficient Ops
Jan 29, 2024 · Operations

Mastering Incident Response: A Practical Guide to Faster Service Recovery

This guide walks ops teams through real‑world incident scenarios, from quick symptom identification and emergency recovery to improving monitoring, crafting concise emergency plans, and leveraging automation for smarter fault handling, helping organizations restore services faster and reduce downtime.

emergency planningfault handlingincident response
0 likes · 14 min read
Mastering Incident Response: A Practical Guide to Faster Service Recovery
Huolala Tech
Huolala Tech
Jan 16, 2024 · Information Security

How Graph Databases Revolutionize Host Security Incident Response

This article explores how HuoLala's host security HIDS leverages Neo4j graph databases and the Neovis.js visualization library to unify process, network, and file data, enabling rapid attack‑chain reconstruction, efficient multi‑cloud incident response, and improved security operations.

CypherHost SecurityNeo4j
0 likes · 16 min read
How Graph Databases Revolutionize Host Security Incident Response
ITPUB
ITPUB
Dec 27, 2023 · Operations

When a Snapshot Share Became a Data Leak: Lessons from a Cloud Ops Failure

A developer mistakenly set a cloud disk snapshot to public, exposing a major client’s data, and recounts the frantic rollback, the ensuing panic among teammates, and the hard‑won operational lessons about high‑risk manual tasks, proper safeguards, and the need for visualized tooling.

Operationsdata securityincident response
0 likes · 10 min read
When a Snapshot Share Became a Data Leak: Lessons from a Cloud Ops Failure
Software Development Quality
Software Development Quality
Nov 28, 2023 · Information Security

D‑Eyes: Fast Incident‑Response Scanning for Ransomware, Malware & Host Configs

D‑Eyes is an open‑source detection and response tool from NSFOCUS that runs on Windows and Linux, offering command‑line utilities to scan files, processes, host information, network connections, and perform baseline and software‑supply‑chain checks, with built‑in YARA rules for ransomware, mining malware, botnets, and webshells.

LinuxWindowsYARA
0 likes · 9 min read
D‑Eyes: Fast Incident‑Response Scanning for Ransomware, Malware & Host Configs
Architect
Architect
Nov 17, 2023 · Information Security

A Real-World Incident of Accidental Public Snapshot Sharing and Lessons Learned

The author recounts a 2018 incident where a cloud disk snapshot was unintentionally made public, exposing customer data, and shares a detailed reflection on the operational mistakes, risk management failures, and recommended safeguards for high‑risk cloud operations.

cloud computingdata securityincident response
0 likes · 9 min read
A Real-World Incident of Accidental Public Snapshot Sharing and Lessons Learned
Data Thinking Notes
Data Thinking Notes
Nov 16, 2023 · Operations

How to Build Robust Data Fault Governance: A Three‑Phase Stability Blueprint

This article outlines a comprehensive data fault governance framework that classifies metrics, defines three development phases, establishes fault‑grading standards, clarifies responsibilities across development, data‑warehouse, and analytics teams, and implements pre‑, during‑, and post‑incident safeguards to dramatically reduce fault frequency and recovery time.

Cross-Team Collaborationautomationdata stability
0 likes · 15 min read
How to Build Robust Data Fault Governance: A Three‑Phase Stability Blueprint
Architecture and Beyond
Architecture and Beyond
Nov 12, 2023 · Frontend Development

Designing a Yellow Banner System for User Notification During Service Outages

The article explains how a configurable yellow banner system can be used on web interfaces to promptly inform users about service disruptions, guide their actions, increase transparency, improve experience, and outline implementation considerations such as configurability, persistence, and independent deployment.

NotificationSystem DesignUser experience
0 likes · 6 min read
Designing a Yellow Banner System for User Notification During Service Outages
JD Tech
JD Tech
Nov 10, 2023 · Operations

Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices

This article explains the concept and importance of MTTR (Mean Time To Repair), shows how to calculate it, and provides a comprehensive set of monitoring, alerting, rapid mitigation, tool‑assisted analysis, and team coordination techniques to significantly shorten incident resolution time and improve system reliability.

MTTROperationsReliability
0 likes · 26 min read
Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices
Data Thinking Notes
Data Thinking Notes
Nov 2, 2023 · Operations

How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse

This article details Bilibili's data quality assurance framework, covering its evolution across four data platform stages, the architecture of its quality data warehouse, core capabilities such as a complete assurance system, digital‑driven continuous optimization, and efficient incident handling, plus case studies, future plans, and a Q&A session.

Big DataBilibiliData Platform
0 likes · 27 min read
How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse
Su San Talks Tech
Su San Talks Tech
Oct 27, 2023 · Operations

What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review

This article walks through Yuque’s October 23 service disruption, detailing each timeline milestone, analyzing the root causes, highlighting the importance of monitoring and data integrity checks, and offering concrete post‑mortem recommendations to improve future incident handling.

Cloud Servicesdisaster recoveryincident response
0 likes · 12 min read
What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review
Bilibili Tech
Bilibili Tech
Sep 8, 2023 · Operations

Design, Implementation, and Governance of an Alert Management Platform

The article details Bilibili’s comprehensive alert‑management platform—its background, cloud‑vs‑self‑built solution comparison, closed‑loop design, distributed architecture, rule configuration, noise‑reduction, automated root‑cause analysis, and governance practices that cut weekly alerts from 1,000 to under 80, while outlining future enhancements.

Alert ManagementDevOpsSRE
0 likes · 19 min read
Design, Implementation, and Governance of an Alert Management Platform
Didi Tech
Didi Tech
Aug 31, 2023 · Big Data

Data Stability Construction and Fault Governance Practices at Didi Customer Service

Didi’s multi‑year data‑stability program for its customer‑service platform progressed through fault‑centered engineering, business‑aligned cross‑team work, and capability normalization, instituting pre‑, mid‑ and post‑fault safeguards, clear ownership, automated alerts and repair tools, which cut fault count by 42 % and more than doubled mean‑time‑to‑repair while boosting team communication and satisfaction.

Data ReliabilityData WarehouseODS
0 likes · 16 min read
Data Stability Construction and Fault Governance Practices at Didi Customer Service
Efficient Ops
Efficient Ops
Aug 15, 2023 · Information Security

How I Recovered a Compromised Linux Server: Step‑by‑Step Incident Response

This article details a real‑world Linux server intrusion, describing the observed symptoms, the forensic investigation using commands like ps, top, last, and grep, the removal of malicious cron jobs and backdoors, and the lessons learned for securing SSH, file attributes, and cloud security groups.

RootkitSSHServer Security
0 likes · 15 min read
How I Recovered a Compromised Linux Server: Step‑by‑Step Incident Response
Continuous Delivery 2.0
Continuous Delivery 2.0
Jul 31, 2023 · Information Security

15 Key Cybersecurity Metrics for Measuring and Improving Security Performance

The article outlines fifteen essential cybersecurity metrics—thirteen process indicators such as mean detection and response times, and two result indicators like data loss incidents and security ROI—to help organizations evaluate, monitor, and improve their security posture and inform investment decisions.

cybersecurityincident responserisk management
0 likes · 4 min read
15 Key Cybersecurity Metrics for Measuring and Improving Security Performance
Tencent Cloud Developer
Tencent Cloud Developer
Jun 8, 2023 · Operations

Stability Governance in Tencent Search: Architecture, Incident Management, and Automation

The article outlines Tencent Search’s stability governance, detailing a multi‑layered availability architecture, disaster‑recovery mechanisms, precise monitoring, rapid emergency workflows, pre‑release interception, extensive automation, and a collaborative governance model that together enhance system resilience, incident detection, and swift remediation.

availability architectureincident responsemonitoring
0 likes · 28 min read
Stability Governance in Tencent Search: Architecture, Incident Management, and Automation
HelloTech
HelloTech
Mar 30, 2023 · Operations

Emergency Response Planning and Practice at Hello (哈啰) for Large‑Scale Promotions

Hello’s technical‑risk team created a comprehensive emergency‑response system for large‑scale promotions—prioritizing core scenarios, running high‑frequency drills, modeling fault‑portraits, defining metric‑based triggers and clear rollback actions—which delivered zero incidents during the 930 Big Sale, achieved over 80 % core‑line coverage, and now aims to automate plan selection and execution.

Case Studyemergency planningincident response
0 likes · 16 min read
Emergency Response Planning and Practice at Hello (哈啰) for Large‑Scale Promotions
Java Captain
Java Captain
Mar 7, 2023 · Information Security

Server Intrusion Investigation and Remediation Steps

This article details a recent server intrusion case, describing the observed symptoms, possible causes, step‑by‑step forensic investigation using commands like ps, top, grep and crontab, and comprehensive remediation actions such as tightening SSH security, unlocking and restoring system binaries, removing malicious scripts, and key lessons for future protection.

SSH HardeningServer Securitychattr
0 likes · 14 min read
Server Intrusion Investigation and Remediation Steps
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 10, 2023 · Operations

How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations

This article explains what stability assurance is, outlines a systematic workflow—including anomaly identification, monitoring configuration, impact assessment, and solution planning—and provides practical methods such as capacity estimation, traffic limiting, load testing, scaling, and pre‑heating to ensure services remain stable during both daily operations and high‑traffic events.

Operationscapacity planningincident response
0 likes · 25 min read
How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations