Tagged articles

incident response

230 articles · Page 2 of 3

Jun 8, 2023 · Operations

Stability Governance in Tencent Search: Architecture, Incident Management, and Automation

The article outlines Tencent Search’s stability governance, detailing a multi‑layered availability architecture, disaster‑recovery mechanisms, precise monitoring, rapid emergency workflows, pre‑release interception, extensive automation, and a collaborative governance model that together enhance system resilience, incident detection, and swift remediation.

Monitoringavailability architectureincident response

0 likes · 28 min read

Stability Governance in Tencent Search: Architecture, Incident Management, and Automation

Ziru Technology

May 26, 2023 · Databases

How We Resolved a Sudden DB Load Spike: Root Cause, Fixes, and SQL Optimization Lessons

This article details a November 2022 database outage caused by a sudden CPU and load surge, explains how the team diagnosed the issue, outlines the emergency steps taken, and shares practical SQL performance optimization recommendations to prevent similar incidents.

OraclePerformance Tuningdatabase

0 likes · 14 min read

How We Resolved a Sudden DB Load Spike: Root Cause, Fixes, and SQL Optimization Lessons

Huolala Tech

Apr 7, 2023 · Operations

How Huolala Built a Scalable Tech Stability System – Key Lessons for Reliability

This article details Huolala's journey in establishing a comprehensive technical stability framework, covering organizational challenges, risk governance, incident response, cultural initiatives, and future automation to enhance system reliability at scale.

OperationsRisk GovernanceSRE

0 likes · 16 min read

How Huolala Built a Scalable Tech Stability System – Key Lessons for Reliability

HelloTech

Mar 30, 2023 · Operations

Emergency Response Planning and Practice at Hello (哈啰) for Large‑Scale Promotions

Hello’s technical‑risk team created a comprehensive emergency‑response system for large‑scale promotions—prioritizing core scenarios, running high‑frequency drills, modeling fault‑portraits, defining metric‑based triggers and clear rollback actions—which delivered zero incidents during the 930 Big Sale, achieved over 80 % core‑line coverage, and now aims to automate plan selection and execution.

Case Studyemergency planningincident response

0 likes · 16 min read

Emergency Response Planning and Practice at Hello (哈啰) for Large‑Scale Promotions

DevOps Cloud Academy

Mar 21, 2023 · Cloud Native

Robusta: An Open‑Source Python Platform for Kubernetes Troubleshooting and Automated Incident Response

Robusta is a Python‑based open‑source platform that layers on top of monitoring stacks like Prometheus to automatically detect, diagnose, and remediate Kubernetes alerts through built‑in automations, optional web UI, and Helm‑based installation for cloud‑native environments.

AutomationCloud NativeObservability

0 likes · 7 min read

Robusta: An Open‑Source Python Platform for Kubernetes Troubleshooting and Automated Incident Response

Java Captain

Mar 7, 2023 · Information Security

Server Intrusion Investigation and Remediation Steps

This article details a recent server intrusion case, describing the observed symptoms, possible causes, step‑by‑step forensic investigation using commands like ps, top, grep and crontab, and comprehensive remediation actions such as tightening SSH security, unlocking and restoring system binaries, removing malicious scripts, and key lessons for future protection.

SSH Hardeningchattrincident response

0 likes · 14 min read

Server Intrusion Investigation and Remediation Steps

Alibaba Cloud Developer

Jan 10, 2023 · Operations

How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations

This article explains what stability assurance is, outlines a systematic workflow—including anomaly identification, monitoring configuration, impact assessment, and solution planning—and provides practical methods such as capacity estimation, traffic limiting, load testing, scaling, and pre‑heating to ensure services remain stable during both daily operations and high‑traffic events.

MonitoringOperationsStability

0 likes · 25 min read

How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations

MaGe Linux Operations

Dec 24, 2022 · Operations

How to Build an Effective Incident Response Team: Roles, Priorities, and Tools

This guide explains essential incident response roles, how to quickly identify the source of a problem, prioritize actions, use efficient communication tools, and address human factors to improve your team's emergency response capabilities.

communication toolshuman factorsincident response

0 likes · 12 min read

How to Build an Effective Incident Response Team: Roles, Priorities, and Tools

Huolala Safety Emergency Response Center

Dec 2, 2022 · Information Security

How to Detect, Contain, and Eradicate the DarkKomet RAT: A Full Incident Response Walkthrough

This article provides a step‑by‑step technical analysis of the DarkKomet remote‑access trojan, covering its capabilities, infection vectors, detection methods using TTP‑driven EDR, containment actions, eradication procedures, root‑cause forensics, and post‑incident recovery measures.

DarkKometEDRForensics

0 likes · 9 min read

How to Detect, Contain, and Eradicate the DarkKomet RAT: A Full Incident Response Walkthrough

dbaplus Community

Nov 29, 2022 · Backend Development

How a Mistaken Delete in ElasticSearch Nearly Erased 17 Million Products – Key Lessons

A senior engineer accidentally issued a DELETE request on an ElasticSearch index holding 17 million product records, triggering a massive data loss incident, and the team’s subsequent recovery strategies, scaling challenges, and process improvements are detailed to guide backend developers.

Microservicesdata indexingincident response

0 likes · 14 min read

How a Mistaken Delete in ElasticSearch Nearly Erased 17 Million Products – Key Lessons

ITPUB

Nov 27, 2022 · Operations

How to Build an Automated Fault‑Self‑Healing System for Large‑Scale Operations

This article explains why nightly disk‑space alerts demand automated fault‑self‑healing, outlines the necessary process standards, describes monitoring platform dimensions, details a multi‑source self‑healing platform with CMDB integration, and provides practical options for script execution and result notification.

CMDBOperations Automationdevops

0 likes · 9 min read

How to Build an Automated Fault‑Self‑Healing System for Large‑Scale Operations

Architects Research Society

Sep 16, 2022 · Operations

Building a Reliability Culture: Practices, Benefits, and Implementation

This article explains what a reliability culture is, why it matters, how to cultivate it through mission statements, early‑stage reliability testing, chaos‑engineering practices like GameDays and FireDrills, and how organizations can continuously learn from incidents to improve system availability and customer trust.

CultureOperationsReliability

0 likes · 18 min read

Building a Reliability Culture: Practices, Benefits, and Implementation

Efficient Ops

Sep 13, 2022 · Information Security

How to Detect and Recover from a Linux Server Intrusion: A Step‑by‑Step Guide

This article details a real‑world Linux server compromise, describing the symptoms, possible causes, investigative commands, hidden malicious scripts, file attribute locks, and practical remediation steps to restore the system and improve future security.

Intrusion DetectionRootkitchattr

0 likes · 15 min read

How to Detect and Recover from a Linux Server Intrusion: A Step‑by‑Step Guide

JavaEdge

Sep 5, 2022 · Operations

Engineering Double‑11‑Scale E‑Commerce Events: A Complete Technical Playbook

From kickoff meetings and traffic forecasting to load‑testing strategies, rate‑limiting designs, emergency runbooks, and post‑event retrospectives, this guide walks engineers through the complete technical workflow required to ensure a Double‑11‑scale e‑commerce promotion runs smoothly and safely.

MonitoringTraffic Engineeringincident response

0 likes · 12 min read

Engineering Double‑11‑Scale E‑Commerce Events: A Complete Technical Playbook

Liangxu Linux

Aug 21, 2022 · Information Security

Master Linux Incident Response: Detect, Remove, and Harden Malware Infections

This guide walks you through a complete Linux incident‑response workflow—identifying suspicious behavior, locating and terminating malicious processes, eliminating virus files, closing persistence mechanisms, and hardening the system to prevent future compromises—using practical shell commands and real‑world examples.

LinuxMalware RemovalSystem Hardening

0 likes · 9 min read

Master Linux Incident Response: Detect, Remove, and Harden Malware Infections

Su San Talks Tech

Jul 13, 2022 · Operations

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

This post‑mortem details the July 2021 Bilibili outage caused by a Lua bug in the OpenResty‑based SLB, describing the timeline, root‑cause analysis, mitigation steps, and the technical and organizational improvements implemented to prevent similar incidents.

Load BalancerLuaSRE

0 likes · 18 min read

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

High Availability Architecture

Jul 12, 2022 · Operations

Postmortem of the July 13, 2021 Bilibili SLB Outage: Timeline, Root Cause, and Improvement Measures

This article details the July 13, 2021 Bilibili service outage caused by a Lua‑based SLB CPU spike, describing the incident timeline, root‑cause analysis of a weight‑zero bug, mitigation steps including new SLB deployment, and the subsequent operational and architectural improvements.

High AvailabilityLoad BalancerLua

0 likes · 17 min read

Postmortem of the July 13, 2021 Bilibili SLB Outage: Timeline, Root Cause, and Improvement Measures

Bilibili Tech

Jul 12, 2022 · Operations

Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements

On July 13 2021 Bilibili’s L7 SLB crashed when a recent Lua deployment set a balancer weight to the string “0”, producing a NaN value that triggered an infinite loop and 100 % CPU, prompting emergency restarts, a fresh cluster rollout, and long‑term safeguards such as automated provisioning, stricter Lua validation, and enhanced multi‑active disaster‑recovery processes.

High AvailabilityLoad BalancerRoot Cause Analysis

0 likes · 17 min read

Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements

Dada Group Technology

Jun 20, 2022 · Information Security

Design and Implementation of JD Daojia Security Operations Center (SOC) Platform

This article details the challenges, design choices, deployment steps, detection model creation, data processing, visualization, and future plans of JD Daojia's security operations platform, highlighting the use of Graylog, Elasticsearch, and MongoDB to achieve scalable, real‑time threat detection and response.

Data VisualizationGraylogSecurity Operations

0 likes · 16 min read

Design and Implementation of JD Daojia Security Operations Center (SOC) Platform

Continuous Delivery 2.0

Jun 17, 2022 · Operations

Addressing SRE Overload: Causes and Mitigation Strategies

The article examines why SRE teams experience overload due to high incident response demands, analyzes contributing factors such as production issues, alert volume, and manual processes, and proposes comprehensive mitigation steps including better testing, load management, and proactive error detection to reduce on‑call burden.

ProductionSREincident response

0 likes · 5 min read

Addressing SRE Overload: Causes and Mitigation Strategies

MaGe Linux Operations

Jun 5, 2022 · Information Security

How a SpringBoot Server Was Hijacked for Crypto Mining – Investigation & Fixes

This article chronicles the discovery of a server breach used for cryptocurrency mining, detailing the malicious code, forensic analysis of the trojan's actions and artifacts, and the step‑by‑step remediation measures taken to secure the system.

Malware Analysiscrypto miningincident response

0 likes · 11 min read

How a SpringBoot Server Was Hijacked for Crypto Mining – Investigation & Fixes

Open Source Linux

Jun 1, 2022 · Information Security

How a SpringBoot Server Was Hijacked for Crypto Mining and What You Can Do

This article chronicles the discovery of a server breach used for cryptocurrency mining, analyzes the malicious Python payload and its system modifications, and provides concrete remediation steps such as system reinstall, non‑root deployment, firewall hardening, and Nginx authentication.

Cryptocurrency MiningMalware Analysisincident response

0 likes · 12 min read

How a SpringBoot Server Was Hijacked for Crypto Mining and What You Can Do

Architecture and Beyond

May 21, 2022 · Product Management

Mastering Product Prioritization: From Requirement Levels to Incident Management

This article explains how limited resources shape product requirement prioritization, test‑bug grading, product‑module classification, online bug severity, and incident response levels, offering practical frameworks and concrete grading tables to help teams make objective, value‑driven decisions throughout a product’s lifecycle.

Operationsbug triageincident response

0 likes · 13 min read

Mastering Product Prioritization: From Requirement Levels to Incident Management

DeWu Technology

May 16, 2022 · Operations

NOC SLA Implementation for Consumer Trading Platform

To tackle growing production complexity and past incident delays, the consumer trading platform introduced a three‑tier NOC‑SLA with intelligent baselines powered by Facebook Prophet, streamlined alert rules, and an SOS‑linked workflow, boosting detection frequency, cutting critical response times to under five minutes, and improving overall system reliability while emphasizing ongoing baseline and rule maintenance.

Alert ManagementNOCOperations

0 likes · 13 min read

NOC SLA Implementation for Consumer Trading Platform

Bilibili Tech

Apr 26, 2022 · Operations

Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

Bilibili’s SRE team combines stability theory, detailed fault‑stage and operational metrics, and a unified emergency‑response platform—including on‑call scheduling, fault‑command incident commanders, automated fault portraits, and rapid post‑mortems—to transform frequent incidents into data‑driven, collaborative recoveries and lay groundwork for AI‑assisted self‑healing.

Business StabilityMetricsOncall

0 likes · 23 min read

Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

DaTaobao Tech

Apr 20, 2022 · Operations

Understanding Wireless Operations and Maintenance: Origins, Challenges, and Future Directions

Wireless operations and maintenance (O&M) evolved from backend‑focused practices to address stability and performance of mobile‑device services, tackling low issue detection rates and delayed responses through improved monitoring, gray‑release tagging, phased rollouts, AI‑driven diagnostics, and automated release gates, while inviting collaborative development.

Monitoringgray-releaseincident response

0 likes · 13 min read

Understanding Wireless Operations and Maintenance: Origins, Challenges, and Future Directions

21CTO

Mar 31, 2022 · Operations

What Caused the Biggest 2021 Outages? Lessons from Bilibili, Facebook, AWS, and More

The article reviews ten major 2021 service outages—from Chinese platforms like Bilibili and Futu to global giants such as Facebook, Roblox, and AWS—analyzing their root causes, redundancy failures, and the operational lessons needed to prevent future black‑swans.

High Availabilityincident responseoutage analysis

0 likes · 15 min read

What Caused the Biggest 2021 Outages? Lessons from Bilibili, Facebook, AWS, and More

Open Source Linux

Mar 31, 2022 · Information Security

Misconfigured Kubelet Triggered Crypto‑Mining Breach – Secure Your Cluster Now

A Kubernetes node was compromised for Monero mining due to empty iptables, an exposed kubelet API, and a mis‑commented configuration, prompting a detailed post‑mortem and practical hardening steps to prevent similar attacks.

crypto miningfirewallincident response

0 likes · 5 min read

Misconfigured Kubelet Triggered Crypto‑Mining Breach – Secure Your Cluster Now

Beike Product & Technology

Feb 18, 2022 · Operations

KeMonitor Alert Platform: Systematic Alert Governance and Practices

The article presents a comprehensive case study of KeMonitor, a one‑stop monitoring and alert platform built by 贝壳找房 to unify fragmented alerts, define lifecycle‑based governance, standardize alert metadata, implement graded subscription, on‑call escalation, silencing, self‑healing, and post‑mortem analysis, thereby improving incident response efficiency and reducing alert fatigue.

AlertingSOPincident response

0 likes · 17 min read

KeMonitor Alert Platform: Systematic Alert Governance and Practices

MaGe Linux Operations

Feb 16, 2022 · Information Security

How a Misconfigured Kubelet Triggered a Crypto‑Mining Breach—and How to Stop It

A Kubernetes cluster was compromised when a misconfigured kubelet allowed anonymous API access, enabling attackers to run a Monero miner; the post details the investigation, root causes, and practical hardening steps to prevent similar intrusions.

crypto miningfirewallincident response

0 likes · 5 min read

How a Misconfigured Kubelet Triggered a Crypto‑Mining Breach—and How to Stop It

Programmer DD

Feb 8, 2022 · Operations

What Triggered the Biggest Internet Outages of 2021? Lessons from 10 Major Incidents

A comprehensive review of ten major 2021 internet outages—from domestic platforms like Bilibili and Futu to global services such as Facebook, Roblox, and AWS—examines their root causes, the role of infrastructure design, and the operational lessons needed to improve system resilience.

cloud infrastructureincident responseoutage analysis

0 likes · 16 min read

What Triggered the Biggest Internet Outages of 2021? Lessons from 10 Major Incidents

IT Services Circle

Feb 2, 2022 · Operations

Huawei Cloud’s New Year Defense: How SRE Teams Counter Massive Attacks

Huawei Cloud’s internal “blue‑team” launched over twenty coordinated attacks around Chinese New Year, but the company’s SRE “red‑team” and a dedicated 24/7 “special forces” unit detected, isolated, and resolved incidents within minutes, keeping failure rates below 0.01% and demonstrating advanced cloud operations and security practices.

SREincident response

0 likes · 9 min read

Huawei Cloud’s New Year Defense: How SRE Teams Counter Massive Attacks

Efficient Ops

Nov 26, 2021 · Information Security

How a Misconfigured Kubelet Led to a Crypto‑Mining Breach and What to Do

A self‑built Kubernetes cluster suffered a crypto‑mining intrusion due to empty iptables and a misconfigured kubelet, prompting a detailed post‑mortem that outlines the symptoms, root‑cause analysis, and practical hardening steps to protect similar environments.

crypto miningfirewallincident response

0 likes · 5 min read

How a Misconfigured Kubelet Led to a Crypto‑Mining Breach and What to Do

Open Source Linux

Nov 25, 2021 · Information Security

Master Linux Incident Response: Step-by-Step Virus Detection and Removal

This guide walks through a four‑stage Linux incident‑response workflow—identifying symptoms, killing malicious processes, closing persistence mechanisms, and hardening the system—while providing the exact shell commands needed to detect and eradicate Linux malware.

LinuxMalware RemovalShell Commands

0 likes · 6 min read

Master Linux Incident Response: Step-by-Step Virus Detection and Removal

Open Source Linux

Nov 8, 2021 · Information Security

Essential Linux Incident Response Commands for Quick Security Investigations

This guide outlines the typical Linux and Windows environments encountered in security incidents, common threats such as mining and ransomware, and provides a step‑by‑step workflow with essential commands for process, user, network, and file investigation to identify and remediate compromises.

File AnalysisLinuxincident response

0 likes · 8 min read

Essential Linux Incident Response Commands for Quick Security Investigations

Java High-Performance Architecture

Oct 20, 2021 · Information Security

How a Misconfigured Kubelet Led to Crypto Mining on Our Kubernetes Node – Lessons Learned

After discovering a suspicious process on one of our self‑built Kubernetes nodes, we traced the intrusion to a misconfigured kubelet that exposed the API, allowing attackers to run a Monero mining script, and we outline the investigation steps and hardening measures to prevent similar breaches.

crypto miningincident responsekubelet

0 likes · 6 min read

How a Misconfigured Kubelet Led to Crypto Mining on Our Kubernetes Node – Lessons Learned

Java Architect Essentials

Aug 22, 2021 · Information Security

Former Hospital Network Administrator Carries Out Revenge Attack, Crippling Xi'an Hospital's Diagnostic Systems

In Xi'an's Lianhu District, a disgruntled former network administrator exploited self‑taught networking skills to illegally infiltrate a hospital's internal servers, remotely executing destructive operations that deleted critical files, disabled printers, CT and ultrasound machines, and ultimately caused the entire medical information system to collapse, prompting a police investigation that led to his arrest and criminal detention for damaging computer information systems.

cybersecurityhospitalincident response

0 likes · 5 min read

Former Hospital Network Administrator Carries Out Revenge Attack, Crippling Xi'an Hospital's Diagnostic Systems

DevOps

Jun 28, 2021 · Databases

When Deleting Databases Goes Wrong: Cases, Legal Risks, and DevOps Lessons

This article examines real-world database deletion incidents, the associated legal consequences, and how DevOps culture and operational best‑practices can turn such mistakes into learning opportunities rather than career‑ending failures.

Operationsdatabase deletiondevops

0 likes · 13 min read

DevOps

Jun 2, 2021 · Operations

Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering

Slack’s resilience engineering team outlines a structured chaos‑engineering workflow—identifying potential failures, ensuring fault tolerance, and deliberately injecting faults in development and production—to safely test system robustness, validate hypotheses, and continuously improve reliability through regular disaster‑theater exercises.

Fault InjectionOperationsReliability

0 likes · 11 min read

Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering

Efficient Ops

Jun 1, 2021 · Operations

Mastering System Stability: Building a Chaos‑Driven Platform for Financial Ops

This article details how a major securities firm analyzed business stability, built a comprehensive stability engineering platform using chaos engineering, practiced extensive fault‑injection drills, and outlines future directions such as random‑scenario exercises, red‑blue battles, and AI‑driven risk detection.

OperationsPlatformchaos engineering

0 likes · 11 min read

Mastering System Stability: Building a Chaos‑Driven Platform for Financial Ops

Alibaba Cloud Developer

May 20, 2021 · Operations

Mastering Production Incident Response: Structured Problem Solving and Key Roles

This guide explains how to design and practice a structured incident‑response process—defining problems, applying quick‑recovery steps, analyzing root causes, standardizing solutions, and assigning critical roles—to dramatically reduce production outage duration.

OperationsSREfault handling

0 likes · 11 min read

Mastering Production Incident Response: Structured Problem Solving and Key Roles

Alibaba Cloud Developer

May 18, 2021 · Operations

Mastering Incident Response: Structured Problem Solving and Key Roles

This guide outlines a structured approach to incident response, detailing problem definition, temporary fixes, root‑cause analysis, solution design, implementation, and standardization, while highlighting four critical roles—commander, communicator, rapid‑recovery lead, and diagnosis lead—to ensure swift, coordinated recovery of production services.

OperationsSREfault-recovery

0 likes · 10 min read

Mastering Incident Response: Structured Problem Solving and Key Roles

Liangxu Linux

Apr 21, 2021 · Information Security

Essential Linux Incident‑Response Commands for Quick Threat Detection

This guide walks through common Linux emergency scenarios—such as mining malware, ransomware, and backdoors—detailing a step‑by‑step workflow and providing essential command‑line tools for process, user, network, and file investigation on CentOS 6 and Windows Server 2008 systems.

ForensicsLinuxincident response

0 likes · 11 min read

Essential Linux Incident‑Response Commands for Quick Threat Detection

dbaplus Community

Apr 19, 2021 · Information Security

How to Diagnose and Remove a Linux Backdoor That Triggers Massive Outbound Traffic

When a server suddenly generated 800 MB of outbound traffic and SSH became unresponsive, the author traced the issue to a hidden backdoor, blocked the malicious IP, identified compromised binaries, removed malicious processes and startup scripts, and outlined preventive security measures.

Linuxbackdoorincident response

0 likes · 9 min read

How to Diagnose and Remove a Linux Backdoor That Triggers Massive Outbound Traffic

Alibaba Cloud Developer

Mar 8, 2021 · Operations

How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook

This article presents a comprehensive, step‑by‑step framework for guaranteeing system reliability during high‑traffic promotional periods, covering SRE hierarchy, stability criteria, profiling, monitoring, capacity planning, incident response, and post‑event analysis to help teams build resilient services.

MonitoringSREcapacity planning

0 likes · 21 min read

How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook

Java Captain

Mar 7, 2021 · Operations

Recovering Accidentally Deleted Production Server Data with ext3grep, extundelete, and MySQL Binlog

This article recounts a production server data loss caused by an erroneous rm -rf command, describes how ext3grep and extundelete were used to attempt file restoration, and explains how MySQL binlog replay finally recovered critical data while sharing lessons learned for future incident handling.

Data RecoveryLinuxbackup

0 likes · 7 min read

Recovering Accidentally Deleted Production Server Data with ext3grep, extundelete, and MySQL Binlog

Liangxu Linux

Feb 25, 2021 · Information Security

How to Automate Linux Incident Response and Analyze a Mining Malware

This article shares a step‑by‑step Linux incident‑response workflow, including an automated Bash information‑gathering script, analysis of malicious cron jobs and a 439‑line mining malware, its SSH‑based lateral spread, and practical cleanup procedures with a reusable toolbox on GitHub.

Bash automationCron JobsCryptocurrency Mining

0 likes · 13 min read

How to Automate Linux Incident Response and Analyze a Mining Malware

Efficient Ops

Jan 13, 2021 · Information Security

How to Detect and Eradicate a Hidden Linux Mining Botnet: A Step‑by‑Step Analysis

This article walks through a real‑world Linux mining malware infection, detailing how the attacker hid a malicious cron job, used LD_PRELOAD rootkits, propagated via SSH keys, and how the analyst uncovered and removed the threat using busybox, strace, and careful forensic commands.

Cryptocurrency MiningMalware Analysisincident response

0 likes · 12 min read

How to Detect and Eradicate a Hidden Linux Mining Botnet: A Step‑by‑Step Analysis

iQIYI Technical Product Team

Jan 8, 2021 · Information Security

SOAR (Security Orchestration, Automation and Response) Implementation at iQIYI: Architecture, Scenarios, and Roadmap

iQIYI’s SOAR platform, built on StackStorm and the Walkoff visual editor, integrates security components, scripts, chat‑ops bots, and a mini‑program to automate detection and response, cutting MTTR by roughly 75% across high‑frequency routine tasks and low‑frequency critical incidents while planning broader coverage and knowledge‑base expansion.

SOARSecurity OperationsStackStorm

0 likes · 8 min read

SOAR (Security Orchestration, Automation and Response) Implementation at iQIYI: Architecture, Scenarios, and Roadmap

Yanxuan Tech Team

Dec 14, 2020 · Operations

Mastering Stability Governance: Practical Strategies for Reliable Supply‑Chain Systems

This article examines the critical role of stability governance in evolving systems, outlines a three‑stage framework—usability, monitoring alerts, and online emergency—illustrated with a case study of an electronic waybill service, and shares concrete strategies for prevention, detection, response, and post‑mortem to achieve predictable, observable, and fast‑acting reliability.

GovernanceMonitoringOperations

0 likes · 11 min read

Mastering Stability Governance: Practical Strategies for Reliable Supply‑Chain Systems

NetEase Yanxuan Technology Product Team

Dec 11, 2020 · Operations

How to Build Effective Stability Governance for E‑commerce Logistics Services

This article analyzes the concept of stability governance, outlines its five fault‑management sub‑domains, examines the pain points of an electronic waybill service, and presents a comprehensive three‑phase strategy—prevention, perception, reach, mitigation, and post‑mortem—backed by concrete implementation steps in availability, monitoring, and online emergency handling.

MonitoringOperationsincident response

0 likes · 12 min read

How to Build Effective Stability Governance for E‑commerce Logistics Services

21CTO

Dec 10, 2020 · Operations

How Netflix’s Telltale Transforms Application Monitoring and Incident Response

This article explains how Netflix built the Telltale monitoring system to consolidate data sources, provide multidimensional health assessments, deliver intelligent alerts, and streamline incident management for over 100 production applications, reducing on‑call fatigue and improving service reliability.

MonitoringNetflixObservability

0 likes · 14 min read

How Netflix’s Telltale Transforms Application Monitoring and Incident Response

macrozheng

Nov 26, 2020 · Information Security

Recovering a Server Hijacked by a Crypto‑Mining Virus: My Step‑by‑Step Fix

After my small 1‑CPU, 2 GB server was compromised by a crypto‑mining virus that hijacked SSH access, I used VNC to investigate, identified malicious processes, traced infected files, removed cron jobs, restored system utilities, repaired SELinux, and closed the Redis vulnerability to fully recover the machine.

LinuxRedis vulnerabilitycrypto mining

0 likes · 10 min read

Recovering a Server Hijacked by a Crypto‑Mining Virus: My Step‑by‑Step Fix

Taobao Frontend Technology

Nov 23, 2020 · Operations

Achieving 1‑5‑10 Front‑End Monitoring with JSTracker for Double‑11

This article explains how the JSTracker platform was used to build a comprehensive end‑to‑end front‑end monitoring and data analysis solution that meets the 1‑5‑10 safety production goal—detecting issues within one minute, locating them in five, and fixing them in ten—by improving coverage, subscription, metrics, and gray‑release monitoring for Alibaba’s Double‑11 promotion.

MonitoringOperationsgray-release

0 likes · 15 min read

Achieving 1‑5‑10 Front‑End Monitoring with JSTracker for Double‑11

JD Cloud Developers

Nov 13, 2020 · Information Security

How JD Cloud Secures Massive E‑Commerce Events: A Multi‑Layered Defense Blueprint

This article explains how JD Cloud builds a multi‑layered, end‑to‑end security architecture—including five defense layers, four implementation stages, three guiding principles, and two key focuses—to protect high‑traffic e‑commerce events such as 618 and 11.11 from attacks and ensure stable, safe operations.

Risk Managementcloud securitye-commerce

0 likes · 15 min read

How JD Cloud Secures Massive E‑Commerce Events: A Multi‑Layered Defense Blueprint

Efficient Ops

Oct 27, 2020 · Information Security

How to Detect Account Security Threats Using Log Analysis and Alerts

This article explains practical methods for detecting account security threats—such as blacklisted, expired, or abnormal login behaviors—by analyzing Linux and Windows login logs, defining detection rules, and leveraging automated tools to generate timely alerts and reduce security risks.

Threat Detectionaccount securityincident response

0 likes · 27 min read

How to Detect Account Security Threats Using Log Analysis and Alerts

Java Backend Technology

Oct 22, 2020 · Information Security

What Caused the Massive P1 Outage? A Real‑World Security Scanning Bug Uncovered

A sudden P1 incident reset all user passwords, and after a thorough investigation the team discovered that a security‑scanning tool’s weak‑password check repeatedly hit login attempts, triggering a bug that caused the outage, highlighting the critical need for proper incident response and security engineering.

OperationsP1 incidentdatabase

0 likes · 7 min read

What Caused the Massive P1 Outage? A Real‑World Security Scanning Bug Uncovered

Liangxu Linux

Oct 6, 2020 · Information Security

How I Uncovered a Phishing Mooncake Email Using Wireshark, Shodan, and OSINT

During the Mid‑Autumn Festival I received a seemingly harmless mooncake email, suspected it was a phishing test, and then used a virtual machine, network‑capture tools, Shodan, and open‑source intelligence to trace the malicious link back to its source and exposed the underlying infrastructure.

Network reconnaissanceOSINTPhishing

0 likes · 4 min read

How I Uncovered a Phishing Mooncake Email Using Wireshark, Shodan, and OSINT

Top Architect

Sep 24, 2020 · Information Security

Case Study: Six-Year Prison Sentence for a Programmer Who Deleted SaaS Data and Its Implications for Data Security

A Shanghai court sentenced programmer He Mou to six years in prison for deliberately deleting all SaaS data of Weimeng, causing an eight‑day outage, over 300 million yuan in losses, and prompting a discussion on on‑premise versus cloud data protection strategies.

Cloud ComputingSaaS incidentdata deletion

0 likes · 9 min read

Case Study: Six-Year Prison Sentence for a Programmer Who Deleted SaaS Data and Its Implications for Data Security

Efficient Ops

Aug 25, 2020 · Operations

How to Build an Enterprise‑Grade Observability System and Master Incident Response

This article explains how enterprises adopting SRE can design a comprehensive observability platform—covering metrics, logs, and tracing—while also detailing effective incident response, post‑mortem practices, testing, capacity planning, automation tool development, and user‑experience focus to improve overall operational reliability.

ObservabilityOperationsSRE

0 likes · 17 min read

How to Build an Enterprise‑Grade Observability System and Master Incident Response

DevOps

Aug 19, 2020 · Operations

DevOps Lessons from the Knight Capital Group Collapse: A Case Study

The article analyzes the 2012 Knight Capital Group disaster, showing how a manual deployment error, lingering legacy code, missing kill‑switch, and inadequate monitoring caused a $4.6 billion loss within 45 minutes, and extracts key DevOps best‑practice lessons to prevent similar failures.

Risk Managementdeploymentdevops

0 likes · 13 min read

DevOps Lessons from the Knight Capital Group Collapse: A Case Study

Qunar Tech Salon

Jul 27, 2020 · Operations

Website Operations at Qunar: Ensuring Stability, Security, and Efficiency During the Pandemic

The interview with Sun Bin, head of Qunar's website operations, explains how the team acted as a specialized "BlackOps" unit to provide robust technical guarantees, automate problem‑solving, protect user data, optimize resources, and maintain continuous service during the COVID‑19 outbreak.

Data SecurityPandemic Responsecloud infrastructure

0 likes · 10 min read

Website Operations at Qunar: Ensuring Stability, Security, and Efficiency During the Pandemic

Aikesheng Open Source Community

Jul 21, 2020 · Databases

Auditing MySQL Operations with init_connect and Binlog Analysis

This article demonstrates how to audit MySQL user actions by configuring init_connect, creating an audit log table, enabling binlog, and analyzing binlog entries to identify the user and IP responsible for accidental table deletions.

AuditingDatabase operationsincident response

0 likes · 8 min read

Auditing MySQL Operations with init_connect and Binlog Analysis

Efficient Ops

Jul 16, 2020 · Information Security

When Revenge Triggers IT Disasters: Lessons on Employee Dissatisfaction and Security

The article examines how personal grievances have led to tragic events—from a bus driver’s fatal crash to destructive IT incidents—highlighting the need for psychological care, robust security policies, and a DevSecOps mindset to prevent such revenge‑driven catastrophes.

DevSecOpsIT Governanceemployee wellbeing

0 likes · 5 min read

When Revenge Triggers IT Disasters: Lessons on Employee Dissatisfaction and Security

Tencent Cloud Developer

May 14, 2020 · Operations

Tencent Classroom Monitoring Practices: Challenges, Strategies, and Future Directions

During the pandemic’s “停课不停学” surge, Tencent Classroom tackled a 120‑fold traffic jump by rapidly deploying Grafana dashboards, Kibana logs, internal Moniter and cloud monitoring tools, establishing a three‑layer feedback‑alert‑on‑call model, and now plans automation, unified visualizations, and chaos‑engineering to further boost observability and service reliability.

Cloud MonitoringSRETencent Classroom

0 likes · 14 min read

Tencent Classroom Monitoring Practices: Challenges, Strategies, and Future Directions

dbaplus Community

Apr 20, 2020 · Operations

Preventing Database Disasters: Key Lessons from the Zhengda Hospital Outage

The Zhengda Hospital HIS database outage, caused by unauthorized scripts and poor permission controls, sparked a detailed discussion on how to prevent reckless production testing, enforce proper authorization, design efficient yet secure workflows, improve outsourcing oversight, and build robust emergency and compliance practices.

Database operationsPermission ManagementProduction Security

0 likes · 12 min read

Preventing Database Disasters: Key Lessons from the Zhengda Hospital Outage

MaGe Linux Operations

Mar 30, 2020 · Information Security

How a Linux Server Became a Botnet: A Step‑by‑Step Rootkit Forensics Walkthrough

This article details a real‑world Linux rootkit intrusion, describing the symptoms, forensic analysis techniques, evidence uncovered, the underlying Awstats vulnerability, and a comprehensive remediation plan to secure the server and prevent future compromises.

AwstatsForensicsRootkit

0 likes · 13 min read

How a Linux Server Became a Botnet: A Step‑by‑Step Rootkit Forensics Walkthrough

Java Backend Technology

Mar 5, 2020 · Operations

How a Massive Delete-Database Crisis at Weimeng Reveals Key Ops Lessons

On Feb 23, Weimeng suffered a large‑scale system outage caused by a core operations staff mistakenly deleting production databases, prompting a multi‑day recovery effort with Tencent Cloud support; the article examines the incident’s background, historical parallels, crisis response, and broader operational insights for DevOps and reliability engineering.

Database RecoveryOperationscrisis management

0 likes · 16 min read

How a Massive Delete-Database Crisis at Weimeng Reveals Key Ops Lessons

ITPUB

Feb 29, 2020 · Information Security

What the Weimeng Database Deletion Reveals About Backup and Permission Strategies

The article analyzes the recent Weimeng data‑loss incident, explains why recovery took 36 hours, highlights insider abuse, and offers a practical guide for small and large teams covering reliable backups, minimal‑privilege management, and cloud‑based disaster‑recovery solutions.

Database SecurityPrivilege Managementbackup strategy

0 likes · 9 min read

What the Weimeng Database Deletion Reveals About Backup and Permission Strategies

Efficient Ops

Feb 26, 2020 · Operations

What the Weimeng Delete‑Database Outage Teaches About Modern Ops

After a core operations staff accidentally deleted Weimeng’s production database in February, the platform endured a multi‑day outage, prompting a transparent crisis response, extensive Tencent Cloud support, and a deep analysis of recovery challenges, operational best practices, and the broader lessons for modern DevOps teams.

Database RecoveryOperationscrisis management

0 likes · 15 min read

What the Weimeng Delete‑Database Outage Teaches About Modern Ops

ITPUB

Feb 26, 2020 · Information Security

What We Learned from the Weimeng Data Deletion Disaster: Backup and Permission Strategies

The article analyzes the recent Weimeng database deletion incident, explains why recovery took 36 hours, and provides practical guidance on backup practices, minimal‑privilege management, and cloud‑based disaster recovery to prevent similar data loss in small and large organizations.

Database SecurityOperationsPermission Management

0 likes · 9 min read

What We Learned from the Weimeng Data Deletion Disaster: Backup and Permission Strategies

Programmer DD

Feb 26, 2020 · Information Security

Inside the Weimob Data Deletion: Lessons on Permissions and Backup

A malicious insider deleted Weimob's primary and backup databases, prompting a slow recovery effort and highlighting the critical need for stricter permission controls and reliable backup mechanisms to prevent similar incidents.

Data lossPermission Managementbackup strategy

0 likes · 5 min read

Inside the Weimob Data Deletion: Lessons on Permissions and Backup

MaGe Linux Operations

Feb 25, 2020 · Operations

What Weimob’s Data Sabotage Teaches About Robust Ops and Security

On February 25, Weimob disclosed that a core operations employee maliciously destroyed SaaS business data, prompting police involvement and a rapid recovery effort, and the incident underscores the need for comprehensive backup, cloud redundancy, strict access controls, automated deployment, and proactive risk planning.

Operations Managementcloud securitydata backup

0 likes · 4 min read

What Weimob’s Data Sabotage Teaches About Robust Ops and Security

Efficient Ops

Feb 17, 2020 · Operations

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

This article outlines how Ping An’s IT operations team systematically prepares for high‑traffic business events—detailing service assessment, architecture mapping, configuration audits, monitoring design, capacity planning, stress testing, and coordinated incident response—to guarantee reliability and performance under massive concurrent loads.

IT OperationsMonitoringPerformance Optimization

0 likes · 15 min read

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

Mafengwo Technology

Feb 8, 2020 · Operations

How a Travel Platform Engineered a Pandemic‑Era Emergency Response: Operations Lessons

During the 2020 Chinese New Year lockdown, a travel platform mobilized its development, product, and operations teams to rapidly build refund systems, coordinate with suppliers, and ensure continuous online services, showcasing a user‑first, cross‑functional emergency strategy that balanced technical delivery with intense customer pressure.

Operationsincident responsepandemic

0 likes · 13 min read

How a Travel Platform Engineered a Pandemic‑Era Emergency Response: Operations Lessons

Alibaba Cloud Developer

Dec 20, 2019 · Operations

How We Traced a 48‑Hour Memory Leak in a Distributed Coordination Service

This article details a step‑by‑step investigation of repeated follower process alerts in a Paxos‑based distributed coordination service, revealing a Java GC pause‑induced memory leak in the front‑end Proxy and describing the rapid mitigation actions taken to restore system stability.

Monitoringdistributed systemsincident response

0 likes · 12 min read

How We Traced a 48‑Hour Memory Leak in a Distributed Coordination Service

Liangxu Linux

Dec 10, 2019 · Information Security

Master Linux Incident Response: Detect, Remove, and Harden Malware Step‑by‑Step

This guide walks you through a complete Linux incident‑response workflow—identifying suspicious behavior, terminating malicious processes, eradicating virus files, closing persistence mechanisms, and hardening the system—while providing concrete shell commands and practical tips for each stage.

Malware RemovalSystem Hardeningincident response

0 likes · 10 min read

Master Linux Incident Response: Detect, Remove, and Harden Malware Step‑by‑Step

Efficient Ops

Dec 5, 2019 · Information Security

Master Linux Incident Response: Step‑by‑Step Virus Detection and Removal

This guide walks you through a complete Linux emergency response workflow—identifying suspicious behavior, terminating malicious processes, removing infected files, eliminating persistence mechanisms, hardening the system, and adding command auditing—using practical shell commands and examples.

LinuxMalware RemovalShell Commands

0 likes · 9 min read

Sohu Tech Products

Oct 23, 2019 · Operations

Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue

This article examines how Google SRE limits weekly alerts to ten, compares it with typical Chinese internet operations teams, and provides practical strategies—including on‑call scheduling, alert escalation, automation, dashboard optimization, and team management—to dramatically reduce alert volume and improve incident response.

Alert ManagementOn-CallOperations

0 likes · 15 min read

Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue

Efficient Ops

Sep 15, 2019 · Operations

Why Ops Needs a Project‑Management Mindset: Lessons from a Simple RAID Change

The article shares practical Ops insights, using a simple RAID change incident to illustrate why operations teams must understand change background, choose optimal timing, act as project managers, and follow a structured change process to protect production environments.

Change ManagementOperationsincident response

0 likes · 8 min read

Why Ops Needs a Project‑Management Mindset: Lessons from a Simple RAID Change

Efficient Ops

Aug 8, 2019 · Operations

10 Ops Murphy’s Laws Every Engineer Should Read Daily

This article shares a set of operational Murphy’s laws, practical process‑management tips, and automation strategies to help ops engineers reduce human error, improve safety, stability, efficiency, and cost‑saving in daily work.

AutomationOperationsincident response

0 likes · 9 min read

10 Ops Murphy’s Laws Every Engineer Should Read Daily

21CTO

Jun 17, 2019 · Information Security

How a Hidden gpg-agentd Malware Hijacked SSH and Exploited Redis on a Cloud Server

A detailed forensic walk‑through reveals how a compromised Alibaba Cloud server was seized via a weak root password, a disguised gpg-agentd binary, malicious cron jobs, and Redis configuration abuse, ultimately enabling password‑less SSH access and large‑scale network scanning for cryptocurrency mining.

Malware Analysiscloud securityincident response

0 likes · 13 min read

How a Hidden gpg-agentd Malware Hijacked SSH and Exploited Redis on a Cloud Server

ITPUB

May 19, 2019 · Information Security

Uncovering a SQL Server Job That Hid a Persistent Malware Loader

This article details a multi‑stage, file‑less attack that leveraged weak SQL Server credentials, Transact‑SQL stored procedures, and WMI to download and execute a downloader (cabs.exe) which fetched multiple botnet components, and explains the forensic steps and remediation measures taken to eradicate the threat.

MalwareSQL ServerStored Procedure

0 likes · 7 min read

Uncovering a SQL Server Job That Hid a Persistent Malware Loader

Efficient Ops

Mar 23, 2019 · Operations

How to Build a Bank Ops SWAT Team for 5‑Minute Incident Recovery

This article explains how a bank can create a specialized Operations SWAT team, define its role, adopt seven essential “weapons” such as layered monitoring, intelligent alerts, communication protocols, automation, and disaster‑recovery tactics, and continuously train the team to meet strict five‑minute recovery targets.

AutomationDisaster RecoveryMonitoring

0 likes · 21 min read

How to Build a Bank Ops SWAT Team for 5‑Minute Incident Recovery

Efficient Ops

Mar 18, 2019 · Operations

How to Build a Bank Ops SWAT Team for Rapid Incident Recovery

This article explains how a bank can create a specialized SWAT‑style operations team, define its roles, adopt seven essential "weapons" such as monitoring and intelligent alerts, and apply ten tactical processes—from communication to automation—to meet strict five‑minute recovery and regulatory requirements.

AutomationDisaster RecoveryMonitoring

0 likes · 21 min read

How to Build a Bank Ops SWAT Team for Rapid Incident Recovery

Efficient Ops

Jan 29, 2019 · Information Security

How Hackers Hijacked a Server with Hidden Accounts and Crypto‑Mining: A Forensic Walkthrough

This article details a multi‑stage server compromise that injected gambling pages, planted hidden accounts, deployed crypto‑mining software, and opened unnecessary ports, providing step‑by‑step forensic analysis, code inspection, emergency response actions, and indicators of compromise.

crypto miningincident responseinformation security

0 likes · 12 min read

How Hackers Hijacked a Server with Hidden Accounts and Crypto‑Mining: A Forensic Walkthrough

NetEase Game Operations Platform

Dec 10, 2018 · Information Security

Understanding and Improving Operations Security: Practices, Risks, and Enterprise‑Level Solutions

This article explains the concept of operations security, why it has become critical, enumerates common mis‑configurations and vulnerabilities such as open ports, weak permissions, insecure scripts and supply‑chain risks, and provides a comprehensive set of best‑practice guidelines and an enterprise‑level framework to build a resilient operations security posture.

Automationincident responseinfrastructure

0 likes · 28 min read

Understanding and Improving Operations Security: Practices, Risks, and Enterprise‑Level Solutions

MaGe Linux Operations

Nov 19, 2018 · Information Security

How a Redis Hijack Exposed Critical Security Gaps and What to Do

This article recounts a Redis server hijack incident, detailing the detection, forensic investigation, removal of malicious files and cron jobs, and provides practical hardening recommendations to prevent similar attacks on Linux environments.

LinuxRedisServer Hijack

0 likes · 7 min read

How a Redis Hijack Exposed Critical Security Gaps and What to Do

Meituan Technology Team

Nov 8, 2018 · Information Security

Intrusion Detection: Concepts, Challenges, and Best Practices

Effective intrusion detection for large enterprises hinges on combining signature‑based pattern matching with baseline anomaly modeling, gathering comprehensive host and network logs, focusing on the GetShell foothold, managing alert fatigue, and integrating AI‑enhanced feature engineering while maintaining robust operational foundations and continuous expertise development.

AIIntrusion DetectionSecurity Operations

0 likes · 31 min read

Intrusion Detection: Concepts, Challenges, and Best Practices

dbaplus Community

Sep 12, 2018 · Operations

Mastering Enterprise Ops Security: Habits, Architecture, and Incident Response

This article presents a comprehensive guide to operational security, covering essential habits, a layered technical architecture, access‑control strategies, CI/CD safeguards, DDoS mitigation, data protection, incident‑response procedures, and collaboration with IT, security, and network teams.

Access ControlCI/CD securityDDoS Defense

0 likes · 20 min read

Mastering Enterprise Ops Security: Habits, Architecture, and Incident Response

Beike Product & Technology

Aug 15, 2018 · Information Security

Malware Incident Response: Analyzing and Removing a Persistent Windows Trojan

This article details a step‑by‑step incident‑response case study of a Windows internal‑network Trojan that exploited SMB port 445, describing how alerts were identified, malicious processes were traced, terminated, and fully removed using tools such as netstat, PChunter, and process monitoring utilities.

Malware AnalysisNetwork ScanningWindows security

0 likes · 6 min read

Malware Incident Response: Analyzing and Removing a Persistent Windows Trojan

Efficient Ops

Jul 8, 2018 · Operations

How to Seamlessly Take Over New Ops Responsibilities: A Practical Checklist

This guide outlines a step‑by‑step approach for taking over new operational responsibilities, covering communication with development leaders, business overview, asset inventory, basic and business‑specific monitoring, standardization, SOP creation, failure drills, cost and capacity planning, and effective cross‑team communication.

Asset ManagementMonitoringOperations

0 likes · 10 min read

How to Seamlessly Take Over New Ops Responsibilities: A Practical Checklist

MaGe Linux Operations

Jul 7, 2018 · Operations

How to Seamlessly Take Over a New Service: An Operations Playbook

This guide outlines a step‑by‑step operations playbook for assuming responsibility of a new business service, covering initial communication, asset inventory, monitoring setup, standardization, SOP creation, incident drills, ongoing optimization, and effective cross‑team communication to ensure stable, low‑cost, and high‑quality service delivery.

Asset ManagementMonitoringSOP

0 likes · 9 min read

How to Seamlessly Take Over a New Service: An Operations Playbook

ITPUB

May 31, 2018 · Information Security

How Attackers Exploit Unauthenticated Redis to Deploy Worms and Mine Cryptocurrency

This article analyzes the recent surge of Redis unauthenticated attacks that install a worm, use pnscan for lateral scanning, modify system settings, and launch cryptocurrency mining, while providing detailed script breakdowns and remediation steps.

Malware AnalysisRediscryptomining

0 likes · 16 min read

How Attackers Exploit Unauthenticated Redis to Deploy Worms and Mine Cryptocurrency

Efficient Ops

May 14, 2018 · Information Security

How to Investigate and Clean a Hacked Website: Step‑by‑Step Security Guide

This article details a real-world incident response to a compromised website that redirected visitors to a gambling site, covering intrusion analysis, server remediation, log tracing, and comprehensive cleanup steps to restore security.

incident responselog analysismalware cleanup

0 likes · 14 min read

How to Investigate and Clean a Hacked Website: Step‑by‑Step Security Guide

ITPUB

Apr 21, 2018 · Operations

Essential Ops Checklist: Avoid Disasters with Proven Practices

A seasoned operations engineer shares a comprehensive guide covering online operation standards, data handling, security hardening, daily monitoring, performance tuning, and the right mindset to prevent costly incidents and ensure stable, secure, and efficient production environments.

Monitoringincident response

0 likes · 14 min read

Essential Ops Checklist: Avoid Disasters with Proven Practices

Java Backend Technology

Apr 2, 2018 · Information Security

How a Hidden Cron Job Hijacked My Server and How I Fixed It

A production server running Tomcat, MySQL, MongoDB and ActiveMQ was taken down by a malicious cron job that executed a cryptomining script, and the article walks through the investigation, removal, and hardening steps to fully recover and secure the system.

Linux Hardeningcron jobcryptomining

0 likes · 4 min read

How a Hidden Cron Job Hijacked My Server and How I Fixed It

Alibaba Cloud Infrastructure

Dec 21, 2017 · Operations

Stability Monitoring Practices for Double 11 2017

The 2017 Double 11 stability monitoring project introduced a four‑layer monitoring architecture—including customer & sentiment, business, system water‑level, and infrastructure monitoring—along with data archiving and system‑level reliability measures to detect, respond to, and mitigate issues far faster than traditional manual processes.

MonitoringOperationsStability

0 likes · 14 min read

Stability Monitoring Practices for Double 11 2017

Efficient Ops

Nov 30, 2017 · Databases

How a Nighttime Hot‑Key Surge Overwhelmed a Database Server—and the Ops Fixes That Saved It

A DBA on call discovers a sudden traffic spike caused by a few massive hot keys on a storage server, quickly isolates the issue, migrates data, applies throttling and caching, and outlines automation ideas to prevent future overloads, illustrating practical database operations and incident response.

Data MigrationHot KeyMonitoring

0 likes · 9 min read

How a Nighttime Hot‑Key Surge Overwhelmed a Database Server—and the Ops Fixes That Saved It

21CTO

Nov 2, 2017 · Operations

How to Diagnose and Fix Online System Issues Efficiently

This article shares practical methods for frontline engineers to quickly understand, assess, and resolve online system problems by categorizing system layers, evaluating impact, using essential Linux monitoring tools, and applying systematic troubleshooting and design‑for‑failure strategies to minimize downtime.

Linux toolsOnline Debuggingbackend operations

0 likes · 11 min read

How to Diagnose and Fix Online System Issues Efficiently