Tagged articles
222 articles
Page 2 of 3
Huolala Safety Emergency Response Center
Huolala Safety Emergency Response Center
Dec 2, 2022 · Information Security

How to Detect, Contain, and Eradicate the DarkKomet RAT: A Full Incident Response Walkthrough

This article provides a step‑by‑step technical analysis of the DarkKomet remote‑access trojan, covering its capabilities, infection vectors, detection methods using TTP‑driven EDR, containment actions, eradication procedures, root‑cause forensics, and post‑incident recovery measures.

DarkKometEDRForensics
0 likes · 9 min read
How to Detect, Contain, and Eradicate the DarkKomet RAT: A Full Incident Response Walkthrough
ITPUB
ITPUB
Nov 27, 2022 · Operations

How to Build an Automated Fault‑Self‑Healing System for Large‑Scale Operations

This article explains why nightly disk‑space alerts demand automated fault‑self‑healing, outlines the necessary process standards, describes monitoring platform dimensions, details a multi‑source self‑healing platform with CMDB integration, and provides practical options for script execution and result notification.

CMDBDevOpsOperations Automation
0 likes · 9 min read
How to Build an Automated Fault‑Self‑Healing System for Large‑Scale Operations
Architects Research Society
Architects Research Society
Sep 16, 2022 · Operations

Building a Reliability Culture: Practices, Benefits, and Implementation

This article explains what a reliability culture is, why it matters, how to cultivate it through mission statements, early‑stage reliability testing, chaos‑engineering practices like GameDays and FireDrills, and how organizations can continuously learn from incidents to improve system availability and customer trust.

CultureOperationsReliability
0 likes · 18 min read
Building a Reliability Culture: Practices, Benefits, and Implementation
JavaEdge
JavaEdge
Sep 5, 2022 · Operations

Engineering Double‑11‑Scale E‑Commerce Events: A Complete Technical Playbook

From kickoff meetings and traffic forecasting to load‑testing strategies, rate‑limiting designs, emergency runbooks, and post‑event retrospectives, this guide walks engineers through the complete technical workflow required to ensure a Double‑11‑scale e‑commerce promotion runs smoothly and safely.

Load TestingTraffic Engineeringincident response
0 likes · 12 min read
Engineering Double‑11‑Scale E‑Commerce Events: A Complete Technical Playbook
Liangxu Linux
Liangxu Linux
Aug 21, 2022 · Information Security

Master Linux Incident Response: Detect, Remove, and Harden Malware Infections

This guide walks you through a complete Linux incident‑response workflow—identifying suspicious behavior, locating and terminating malicious processes, eliminating virus files, closing persistence mechanisms, and hardening the system to prevent future compromises—using practical shell commands and real‑world examples.

LinuxMalware RemovalSecurity
0 likes · 9 min read
Master Linux Incident Response: Detect, Remove, and Harden Malware Infections
High Availability Architecture
High Availability Architecture
Jul 12, 2022 · Operations

Postmortem of the July 13, 2021 Bilibili SLB Outage: Timeline, Root Cause, and Improvement Measures

This article details the July 13, 2021 Bilibili service outage caused by a Lua‑based SLB CPU spike, describing the incident timeline, root‑cause analysis of a weight‑zero bug, mitigation steps including new SLB deployment, and the subsequent operational and architectural improvements.

Load BalancerLuaRoot Cause Analysis
0 likes · 17 min read
Postmortem of the July 13, 2021 Bilibili SLB Outage: Timeline, Root Cause, and Improvement Measures
Bilibili Tech
Bilibili Tech
Jul 12, 2022 · Operations

Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements

On July 13 2021 Bilibili’s L7 SLB crashed when a recent Lua deployment set a balancer weight to the string “0”, producing a NaN value that triggered an infinite loop and 100 % CPU, prompting emergency restarts, a fresh cluster rollout, and long‑term safeguards such as automated provisioning, stricter Lua validation, and enhanced multi‑active disaster‑recovery processes.

Load BalancerRoot Cause AnalysisSLB
0 likes · 17 min read
Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements
Dada Group Technology
Dada Group Technology
Jun 20, 2022 · Information Security

Design and Implementation of JD Daojia Security Operations Center (SOC) Platform

This article details the challenges, design choices, deployment steps, detection model creation, data processing, visualization, and future plans of JD Daojia's security operations platform, highlighting the use of Graylog, Elasticsearch, and MongoDB to achieve scalable, real‑time threat detection and response.

Data visualizationGraylogSOC
0 likes · 16 min read
Design and Implementation of JD Daojia Security Operations Center (SOC) Platform
Continuous Delivery 2.0
Continuous Delivery 2.0
Jun 17, 2022 · Operations

Addressing SRE Overload: Causes and Mitigation Strategies

The article examines why SRE teams experience overload due to high incident response demands, analyzes contributing factors such as production issues, alert volume, and manual processes, and proposes comprehensive mitigation steps including better testing, load management, and proactive error detection to reduce on‑call burden.

SREincident responseoverload
0 likes · 5 min read
Addressing SRE Overload: Causes and Mitigation Strategies
Open Source Linux
Open Source Linux
Jun 1, 2022 · Information Security

How a SpringBoot Server Was Hijacked for Crypto Mining and What You Can Do

This article chronicles the discovery of a server breach used for cryptocurrency mining, analyzes the malicious Python payload and its system modifications, and provides concrete remediation steps such as system reinstall, non‑root deployment, firewall hardening, and Nginx authentication.

Cryptocurrency MiningServer SecuritySpringBoot
0 likes · 12 min read
How a SpringBoot Server Was Hijacked for Crypto Mining and What You Can Do
Architecture and Beyond
Architecture and Beyond
May 21, 2022 · Product Management

Mastering Product Prioritization: From Requirement Levels to Incident Management

This article explains how limited resources shape product requirement prioritization, test‑bug grading, product‑module classification, online bug severity, and incident response levels, offering practical frameworks and concrete grading tables to help teams make objective, value‑driven decisions throughout a product’s lifecycle.

Operationsbug triageincident response
0 likes · 13 min read
Mastering Product Prioritization: From Requirement Levels to Incident Management
DeWu Technology
DeWu Technology
May 16, 2022 · Operations

NOC SLA Implementation for Consumer Trading Platform

To tackle growing production complexity and past incident delays, the consumer trading platform introduced a three‑tier NOC‑SLA with intelligent baselines powered by Facebook Prophet, streamlined alert rules, and an SOS‑linked workflow, boosting detection frequency, cutting critical response times to under five minutes, and improving overall system reliability while emphasizing ongoing baseline and rule maintenance.

Alert ManagementNOCOperations
0 likes · 13 min read
NOC SLA Implementation for Consumer Trading Platform
Bilibili Tech
Bilibili Tech
Apr 26, 2022 · Operations

Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

Bilibili’s SRE team combines stability theory, detailed fault‑stage and operational metrics, and a unified emergency‑response platform—including on‑call scheduling, fault‑command incident commanders, automated fault portraits, and rapid post‑mortems—to transform frequent incidents into data‑driven, collaborative recoveries and lay groundwork for AI‑assisted self‑healing.

Business StabilityOncallOperations
0 likes · 23 min read
Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation
DaTaobao Tech
DaTaobao Tech
Apr 20, 2022 · Operations

Understanding Wireless Operations and Maintenance: Origins, Challenges, and Future Directions

Wireless operations and maintenance (O&M) evolved from backend‑focused practices to address stability and performance of mobile‑device services, tackling low issue detection rates and delayed responses through improved monitoring, gray‑release tagging, phased rollouts, AI‑driven diagnostics, and automated release gates, while inviting collaborative development.

gray releaseincident responsemobile maintenance
0 likes · 13 min read
Understanding Wireless Operations and Maintenance: Origins, Challenges, and Future Directions
21CTO
21CTO
Mar 31, 2022 · Operations

What Caused the Biggest 2021 Outages? Lessons from Bilibili, Facebook, AWS, and More

The article reviews ten major 2021 service outages—from Chinese platforms like Bilibili and Futu to global giants such as Facebook, Roblox, and AWS—analyzing their root causes, redundancy failures, and the operational lessons needed to prevent future black‑swans.

high availabilityincident responseoutage analysis
0 likes · 15 min read
What Caused the Biggest 2021 Outages? Lessons from Bilibili, Facebook, AWS, and More
Beike Product & Technology
Beike Product & Technology
Feb 18, 2022 · Operations

KeMonitor Alert Platform: Systematic Alert Governance and Practices

The article presents a comprehensive case study of KeMonitor, a one‑stop monitoring and alert platform built by 贝壳找房 to unify fragmented alerts, define lifecycle‑based governance, standardize alert metadata, implement graded subscription, on‑call escalation, silencing, self‑healing, and post‑mortem analysis, thereby improving incident response efficiency and reducing alert fatigue.

AlertingSOPincident response
0 likes · 17 min read
KeMonitor Alert Platform: Systematic Alert Governance and Practices
Programmer DD
Programmer DD
Feb 8, 2022 · Operations

What Triggered the Biggest Internet Outages of 2021? Lessons from 10 Major Incidents

A comprehensive review of ten major 2021 internet outages—from domestic platforms like Bilibili and Futu to global services such as Facebook, Roblox, and AWS—examines their root causes, the role of infrastructure design, and the operational lessons needed to improve system resilience.

cloud infrastructureincident responseoutage analysis
0 likes · 16 min read
What Triggered the Biggest Internet Outages of 2021? Lessons from 10 Major Incidents
IT Services Circle
IT Services Circle
Feb 2, 2022 · Operations

Huawei Cloud’s New Year Defense: How SRE Teams Counter Massive Attacks

Huawei Cloud’s internal “blue‑team” launched over twenty coordinated attacks around Chinese New Year, but the company’s SRE “red‑team” and a dedicated 24/7 “special forces” unit detected, isolated, and resolved incidents within minutes, keeping failure rates below 0.01% and demonstrating advanced cloud operations and security practices.

SREincident response
0 likes · 9 min read
Huawei Cloud’s New Year Defense: How SRE Teams Counter Massive Attacks
Efficient Ops
Efficient Ops
Nov 26, 2021 · Information Security

How a Misconfigured Kubelet Led to a Crypto‑Mining Breach and What to Do

A self‑built Kubernetes cluster suffered a crypto‑mining intrusion due to empty iptables and a misconfigured kubelet, prompting a detailed post‑mortem that outlines the symptoms, root‑cause analysis, and practical hardening steps to protect similar environments.

crypto miningfirewallincident response
0 likes · 5 min read
How a Misconfigured Kubelet Led to a Crypto‑Mining Breach and What to Do
Open Source Linux
Open Source Linux
Nov 25, 2021 · Information Security

Master Linux Incident Response: Step-by-Step Virus Detection and Removal

This guide walks through a four‑stage Linux incident‑response workflow—identifying symptoms, killing malicious processes, closing persistence mechanisms, and hardening the system—while providing the exact shell commands needed to detect and eradicate Linux malware.

LinuxMalware RemovalShell Commands
0 likes · 6 min read
Master Linux Incident Response: Step-by-Step Virus Detection and Removal
Open Source Linux
Open Source Linux
Nov 8, 2021 · Information Security

Essential Linux Incident Response Commands for Quick Security Investigations

This guide outlines the typical Linux and Windows environments encountered in security incidents, common threats such as mining and ransomware, and provides a step‑by‑step workflow with essential commands for process, user, network, and file investigation to identify and remediate compromises.

File AnalysisLinuxSecurity
0 likes · 8 min read
Essential Linux Incident Response Commands for Quick Security Investigations
Java High-Performance Architecture
Java High-Performance Architecture
Oct 20, 2021 · Information Security

How a Misconfigured Kubelet Led to Crypto Mining on Our Kubernetes Node – Lessons Learned

After discovering a suspicious process on one of our self‑built Kubernetes nodes, we traced the intrusion to a misconfigured kubelet that exposed the API, allowing attackers to run a Monero mining script, and we outline the investigation steps and hardening measures to prevent similar breaches.

KubernetesSecuritycrypto mining
0 likes · 6 min read
How a Misconfigured Kubelet Led to Crypto Mining on Our Kubernetes Node – Lessons Learned
Java Architect Essentials
Java Architect Essentials
Aug 22, 2021 · Information Security

Former Hospital Network Administrator Carries Out Revenge Attack, Crippling Xi'an Hospital's Diagnostic Systems

In Xi'an's Lianhu District, a disgruntled former network administrator exploited self‑taught networking skills to illegally infiltrate a hospital's internal servers, remotely executing destructive operations that deleted critical files, disabled printers, CT and ultrasound machines, and ultimately caused the entire medical information system to collapse, prompting a police investigation that led to his arrest and criminal detention for damaging computer information systems.

cybersecurityhospitalincident response
0 likes · 5 min read
Former Hospital Network Administrator Carries Out Revenge Attack, Crippling Xi'an Hospital's Diagnostic Systems
DevOps
DevOps
Jun 2, 2021 · Operations

Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering

Slack’s resilience engineering team outlines a structured chaos‑engineering workflow—identifying potential failures, ensuring fault tolerance, and deliberately injecting faults in development and production—to safely test system robustness, validate hypotheses, and continuously improve reliability through regular disaster‑theater exercises.

Fault InjectionOperationsReliability
0 likes · 11 min read
Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering
Efficient Ops
Efficient Ops
Jun 1, 2021 · Operations

Mastering System Stability: Building a Chaos‑Driven Platform for Financial Ops

This article details how a major securities firm analyzed business stability, built a comprehensive stability engineering platform using chaos engineering, practiced extensive fault‑injection drills, and outlines future directions such as random‑scenario exercises, red‑blue battles, and AI‑driven risk detection.

Operationschaos engineeringfinancial systems
0 likes · 11 min read
Mastering System Stability: Building a Chaos‑Driven Platform for Financial Ops
Alibaba Cloud Developer
Alibaba Cloud Developer
May 18, 2021 · Operations

Mastering Incident Response: Structured Problem Solving and Key Roles

This guide outlines a structured approach to incident response, detailing problem definition, temporary fixes, root‑cause analysis, solution design, implementation, and standardization, while highlighting four critical roles—commander, communicator, rapid‑recovery lead, and diagnosis lead—to ensure swift, coordinated recovery of production services.

OperationsSRETeam Roles
0 likes · 10 min read
Mastering Incident Response: Structured Problem Solving and Key Roles
Liangxu Linux
Liangxu Linux
Apr 21, 2021 · Information Security

Essential Linux Incident‑Response Commands for Quick Threat Detection

This guide walks through common Linux emergency scenarios—such as mining malware, ransomware, and backdoors—detailing a step‑by‑step workflow and providing essential command‑line tools for process, user, network, and file investigation on CentOS 6 and Windows Server 2008 systems.

ForensicsLinuxSecurity
0 likes · 11 min read
Essential Linux Incident‑Response Commands for Quick Threat Detection
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 8, 2021 · Operations

How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook

This article presents a comprehensive, step‑by‑step framework for guaranteeing system reliability during high‑traffic promotional periods, covering SRE hierarchy, stability criteria, profiling, monitoring, capacity planning, incident response, and post‑event analysis to help teams build resilient services.

SREcapacity planningincident response
0 likes · 21 min read
How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook
Liangxu Linux
Liangxu Linux
Feb 25, 2021 · Information Security

How to Automate Linux Incident Response and Analyze a Mining Malware

This article shares a step‑by‑step Linux incident‑response workflow, including an automated Bash information‑gathering script, analysis of malicious cron jobs and a 439‑line mining malware, its SSH‑based lateral spread, and practical cleanup procedures with a reusable toolbox on GitHub.

Bash AutomationCron JobsCryptocurrency Mining
0 likes · 13 min read
How to Automate Linux Incident Response and Analyze a Mining Malware
Efficient Ops
Efficient Ops
Jan 13, 2021 · Information Security

How to Detect and Eradicate a Hidden Linux Mining Botnet: A Step‑by‑Step Analysis

This article walks through a real‑world Linux mining malware infection, detailing how the attacker hid a malicious cron job, used LD_PRELOAD rootkits, propagated via SSH keys, and how the analyst uncovered and removed the threat using busybox, strace, and careful forensic commands.

Cryptocurrency Miningincident responsemalware analysis
0 likes · 12 min read
How to Detect and Eradicate a Hidden Linux Mining Botnet: A Step‑by‑Step Analysis
iQIYI Technical Product Team
iQIYI Technical Product Team
Jan 8, 2021 · Information Security

SOAR (Security Orchestration, Automation and Response) Implementation at iQIYI: Architecture, Scenarios, and Roadmap

iQIYI’s SOAR platform, built on StackStorm and the Walkoff visual editor, integrates security components, scripts, chat‑ops bots, and a mini‑program to automate detection and response, cutting MTTR by roughly 75% across high‑frequency routine tasks and low‑frequency critical incidents while planning broader coverage and knowledge‑base expansion.

SOARSecurity OperationsStackStorm
0 likes · 8 min read
SOAR (Security Orchestration, Automation and Response) Implementation at iQIYI: Architecture, Scenarios, and Roadmap
Yanxuan Tech Team
Yanxuan Tech Team
Dec 14, 2020 · Operations

Mastering Stability Governance: Practical Strategies for Reliable Supply‑Chain Systems

This article examines the critical role of stability governance in evolving systems, outlines a three‑stage framework—usability, monitoring alerts, and online emergency—illustrated with a case study of an electronic waybill service, and shares concrete strategies for prevention, detection, response, and post‑mortem to achieve predictable, observable, and fast‑acting reliability.

Operationsgovernanceincident response
0 likes · 11 min read
Mastering Stability Governance: Practical Strategies for Reliable Supply‑Chain Systems
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Dec 11, 2020 · Operations

How to Build Effective Stability Governance for E‑commerce Logistics Services

This article analyzes the concept of stability governance, outlines its five fault‑management sub‑domains, examines the pain points of an electronic waybill service, and presents a comprehensive three‑phase strategy—prevention, perception, reach, mitigation, and post‑mortem—backed by concrete implementation steps in availability, monitoring, and online emergency handling.

LogisticsOperationsincident response
0 likes · 12 min read
How to Build Effective Stability Governance for E‑commerce Logistics Services
21CTO
21CTO
Dec 10, 2020 · Operations

How Netflix’s Telltale Transforms Application Monitoring and Incident Response

This article explains how Netflix built the Telltale monitoring system to consolidate data sources, provide multidimensional health assessments, deliver intelligent alerts, and streamline incident management for over 100 production applications, reducing on‑call fatigue and improving service reliability.

NetflixObservabilityincident response
0 likes · 14 min read
How Netflix’s Telltale Transforms Application Monitoring and Incident Response
macrozheng
macrozheng
Nov 26, 2020 · Information Security

Recovering a Server Hijacked by a Crypto‑Mining Virus: My Step‑by‑Step Fix

After my small 1‑CPU, 2 GB server was compromised by a crypto‑mining virus that hijacked SSH access, I used VNC to investigate, identified malicious processes, traced infected files, removed cron jobs, restored system utilities, repaired SELinux, and closed the Redis vulnerability to fully recover the machine.

LinuxRedis vulnerabilitySSH
0 likes · 10 min read
Recovering a Server Hijacked by a Crypto‑Mining Virus: My Step‑by‑Step Fix
Taobao Frontend Technology
Taobao Frontend Technology
Nov 23, 2020 · Operations

Achieving 1‑5‑10 Front‑End Monitoring with JSTracker for Double‑11

This article explains how the JSTracker platform was used to build a comprehensive end‑to‑end front‑end monitoring and data analysis solution that meets the 1‑5‑10 safety production goal—detecting issues within one minute, locating them in five, and fixing them in ten—by improving coverage, subscription, metrics, and gray‑release monitoring for Alibaba’s Double‑11 promotion.

Operationsgray releaseincident response
0 likes · 15 min read
Achieving 1‑5‑10 Front‑End Monitoring with JSTracker for Double‑11
JD Cloud Developers
JD Cloud Developers
Nov 13, 2020 · Information Security

How JD Cloud Secures Massive E‑Commerce Events: A Multi‑Layered Defense Blueprint

This article explains how JD Cloud builds a multi‑layered, end‑to‑end security architecture—including five defense layers, four implementation stages, three guiding principles, and two key focuses—to protect high‑traffic e‑commerce events such as 618 and 11.11 from attacks and ensure stable, safe operations.

cloud securitye‑commerceincident response
0 likes · 15 min read
How JD Cloud Secures Massive E‑Commerce Events: A Multi‑Layered Defense Blueprint
Efficient Ops
Efficient Ops
Oct 27, 2020 · Information Security

How to Detect Account Security Threats Using Log Analysis and Alerts

This article explains practical methods for detecting account security threats—such as blacklisted, expired, or abnormal login behaviors—by analyzing Linux and Windows login logs, defining detection rules, and leveraging automated tools to generate timely alerts and reduce security risks.

Threat Detectionaccount securityincident response
0 likes · 27 min read
How to Detect Account Security Threats Using Log Analysis and Alerts
Java Backend Technology
Java Backend Technology
Oct 22, 2020 · Information Security

What Caused the Massive P1 Outage? A Real‑World Security Scanning Bug Uncovered

A sudden P1 incident reset all user passwords, and after a thorough investigation the team discovered that a security‑scanning tool’s weak‑password check repeatedly hit login attempts, triggering a bug that caused the outage, highlighting the critical need for proper incident response and security engineering.

OperationsP1 incidentdatabase
0 likes · 7 min read
What Caused the Massive P1 Outage? A Real‑World Security Scanning Bug Uncovered
Liangxu Linux
Liangxu Linux
Oct 6, 2020 · Information Security

How I Uncovered a Phishing Mooncake Email Using Wireshark, Shodan, and OSINT

During the Mid‑Autumn Festival I received a seemingly harmless mooncake email, suspected it was a phishing test, and then used a virtual machine, network‑capture tools, Shodan, and open‑source intelligence to trace the malicious link back to its source and exposed the underlying infrastructure.

Network ReconnaissanceOSINTShodan
0 likes · 4 min read
How I Uncovered a Phishing Mooncake Email Using Wireshark, Shodan, and OSINT
Efficient Ops
Efficient Ops
Aug 25, 2020 · Operations

How to Build an Enterprise‑Grade Observability System and Master Incident Response

This article explains how enterprises adopting SRE can design a comprehensive observability platform—covering metrics, logs, and tracing—while also detailing effective incident response, post‑mortem practices, testing, capacity planning, automation tool development, and user‑experience focus to improve overall operational reliability.

ObservabilityOperationsSRE
0 likes · 17 min read
How to Build an Enterprise‑Grade Observability System and Master Incident Response
DevOps
DevOps
Aug 19, 2020 · Operations

DevOps Lessons from the Knight Capital Group Collapse: A Case Study

The article analyzes the 2012 Knight Capital Group disaster, showing how a manual deployment error, lingering legacy code, missing kill‑switch, and inadequate monitoring caused a $4.6 billion loss within 45 minutes, and extracts key DevOps best‑practice lessons to prevent similar failures.

DeploymentDevOpsTechnical Debt
0 likes · 13 min read
DevOps Lessons from the Knight Capital Group Collapse: A Case Study
Qunar Tech Salon
Qunar Tech Salon
Jul 27, 2020 · Operations

Website Operations at Qunar: Ensuring Stability, Security, and Efficiency During the Pandemic

The interview with Sun Bin, head of Qunar's website operations, explains how the team acted as a specialized "BlackOps" unit to provide robust technical guarantees, automate problem‑solving, protect user data, optimize resources, and maintain continuous service during the COVID‑19 outbreak.

Cost OptimizationPandemic Responsecloud infrastructure
0 likes · 10 min read
Website Operations at Qunar: Ensuring Stability, Security, and Efficiency During the Pandemic
Tencent Cloud Developer
Tencent Cloud Developer
May 14, 2020 · Operations

Tencent Classroom Monitoring Practices: Challenges, Strategies, and Future Directions

During the pandemic’s “停课不停学” surge, Tencent Classroom tackled a 120‑fold traffic jump by rapidly deploying Grafana dashboards, Kibana logs, internal Moniter and cloud monitoring tools, establishing a three‑layer feedback‑alert‑on‑call model, and now plans automation, unified visualizations, and chaos‑engineering to further boost observability and service reliability.

DevOpsSRETencent Classroom
0 likes · 14 min read
Tencent Classroom Monitoring Practices: Challenges, Strategies, and Future Directions
dbaplus Community
dbaplus Community
Apr 20, 2020 · Operations

Preventing Database Disasters: Key Lessons from the Zhengda Hospital Outage

The Zhengda Hospital HIS database outage, caused by unauthorized scripts and poor permission controls, sparked a detailed discussion on how to prevent reckless production testing, enforce proper authorization, design efficient yet secure workflows, improve outsourcing oversight, and build robust emergency and compliance practices.

Database operationsDevOpsProduction Security
0 likes · 12 min read
Preventing Database Disasters: Key Lessons from the Zhengda Hospital Outage
Java Backend Technology
Java Backend Technology
Mar 5, 2020 · Operations

How a Massive Delete-Database Crisis at Weimeng Reveals Key Ops Lessons

On Feb 23, Weimeng suffered a large‑scale system outage caused by a core operations staff mistakenly deleting production databases, prompting a multi‑day recovery effort with Tencent Cloud support; the article examines the incident’s background, historical parallels, crisis response, and broader operational insights for DevOps and reliability engineering.

Database RecoveryDevOpsOperations
0 likes · 16 min read
How a Massive Delete-Database Crisis at Weimeng Reveals Key Ops Lessons
ITPUB
ITPUB
Feb 29, 2020 · Information Security

What the Weimeng Database Deletion Reveals About Backup and Permission Strategies

The article analyzes the recent Weimeng data‑loss incident, explains why recovery took 36 hours, highlights insider abuse, and offers a practical guide for small and large teams covering reliable backups, minimal‑privilege management, and cloud‑based disaster‑recovery solutions.

Database SecurityPrivilege Managementbackup strategy
0 likes · 9 min read
What the Weimeng Database Deletion Reveals About Backup and Permission Strategies
Efficient Ops
Efficient Ops
Feb 26, 2020 · Operations

What the Weimeng Delete‑Database Outage Teaches About Modern Ops

After a core operations staff accidentally deleted Weimeng’s production database in February, the platform endured a multi‑day outage, prompting a transparent crisis response, extensive Tencent Cloud support, and a deep analysis of recovery challenges, operational best practices, and the broader lessons for modern DevOps teams.

Database RecoveryOperationscrisis management
0 likes · 15 min read
What the Weimeng Delete‑Database Outage Teaches About Modern Ops
ITPUB
ITPUB
Feb 26, 2020 · Information Security

What We Learned from the Weimeng Data Deletion Disaster: Backup and Permission Strategies

The article analyzes the recent Weimeng database deletion incident, explains why recovery took 36 hours, and provides practical guidance on backup practices, minimal‑privilege management, and cloud‑based disaster recovery to prevent similar data loss in small and large organizations.

BackupDatabase SecurityOperations
0 likes · 9 min read
What We Learned from the Weimeng Data Deletion Disaster: Backup and Permission Strategies
Programmer DD
Programmer DD
Feb 26, 2020 · Information Security

Inside the Weimob Data Deletion: Lessons on Permissions and Backup

A malicious insider deleted Weimob's primary and backup databases, prompting a slow recovery effort and highlighting the critical need for stricter permission controls and reliable backup mechanisms to prevent similar incidents.

Data lossbackup strategyincident response
0 likes · 5 min read
Inside the Weimob Data Deletion: Lessons on Permissions and Backup
MaGe Linux Operations
MaGe Linux Operations
Feb 25, 2020 · Operations

What Weimob’s Data Sabotage Teaches About Robust Ops and Security

On February 25, Weimob disclosed that a core operations employee maliciously destroyed SaaS business data, prompting police involvement and a rapid recovery effort, and the incident underscores the need for comprehensive backup, cloud redundancy, strict access controls, automated deployment, and proactive risk planning.

KubernetesOperations Managementcloud security
0 likes · 4 min read
What Weimob’s Data Sabotage Teaches About Robust Ops and Security
Efficient Ops
Efficient Ops
Feb 17, 2020 · Operations

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

This article outlines how Ping An’s IT operations team systematically prepares for high‑traffic business events—detailing service assessment, architecture mapping, configuration audits, monitoring design, capacity planning, stress testing, and coordinated incident response—to guarantee reliability and performance under massive concurrent loads.

IT OperationsPerformance Optimizationcapacity planning
0 likes · 15 min read
How Top IT Ops Teams Ensure Seamless Large-Scale Business Events
Mafengwo Technology
Mafengwo Technology
Feb 8, 2020 · Operations

How a Travel Platform Engineered a Pandemic‑Era Emergency Response: Operations Lessons

During the 2020 Chinese New Year lockdown, a travel platform mobilized its development, product, and operations teams to rapidly build refund systems, coordinate with suppliers, and ensure continuous online services, showcasing a user‑first, cross‑functional emergency strategy that balanced technical delivery with intense customer pressure.

OperationsSoftware Engineeringincident response
0 likes · 13 min read
How a Travel Platform Engineered a Pandemic‑Era Emergency Response: Operations Lessons
Liangxu Linux
Liangxu Linux
Dec 10, 2019 · Information Security

Master Linux Incident Response: Detect, Remove, and Harden Malware Step‑by‑Step

This guide walks you through a complete Linux incident‑response workflow—identifying suspicious behavior, terminating malicious processes, eradicating virus files, closing persistence mechanisms, and hardening the system—while providing concrete shell commands and practical tips for each stage.

Malware RemovalSecuritySystem Hardening
0 likes · 10 min read
Master Linux Incident Response: Detect, Remove, and Harden Malware Step‑by‑Step
Efficient Ops
Efficient Ops
Dec 5, 2019 · Information Security

Master Linux Incident Response: Step‑by‑Step Virus Detection and Removal

This guide walks you through a complete Linux emergency response workflow—identifying suspicious behavior, terminating malicious processes, removing infected files, eliminating persistence mechanisms, hardening the system, and adding command auditing—using practical shell commands and examples.

LinuxMalware RemovalSecurity
0 likes · 9 min read
Master Linux Incident Response: Step‑by‑Step Virus Detection and Removal
Sohu Tech Products
Sohu Tech Products
Oct 23, 2019 · Operations

Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue

This article examines how Google SRE limits weekly alerts to ten, compares it with typical Chinese internet operations teams, and provides practical strategies—including on‑call scheduling, alert escalation, automation, dashboard optimization, and team management—to dramatically reduce alert volume and improve incident response.

Alert ManagementOn-CallOperations
0 likes · 15 min read
Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue
Efficient Ops
Efficient Ops
Aug 8, 2019 · Operations

10 Ops Murphy’s Laws Every Engineer Should Read Daily

This article shares a set of operational Murphy’s laws, practical process‑management tips, and automation strategies to help ops engineers reduce human error, improve safety, stability, efficiency, and cost‑saving in daily work.

AutomationOperationsincident response
0 likes · 9 min read
10 Ops Murphy’s Laws Every Engineer Should Read Daily
21CTO
21CTO
Jun 17, 2019 · Information Security

How a Hidden gpg-agentd Malware Hijacked SSH and Exploited Redis on a Cloud Server

A detailed forensic walk‑through reveals how a compromised Alibaba Cloud server was seized via a weak root password, a disguised gpg-agentd binary, malicious cron jobs, and Redis configuration abuse, ultimately enabling password‑less SSH access and large‑scale network scanning for cryptocurrency mining.

cloud securityincident responsemalware analysis
0 likes · 13 min read
How a Hidden gpg-agentd Malware Hijacked SSH and Exploited Redis on a Cloud Server
ITPUB
ITPUB
May 19, 2019 · Information Security

Uncovering a SQL Server Job That Hid a Persistent Malware Loader

This article details a multi‑stage, file‑less attack that leveraged weak SQL Server credentials, Transact‑SQL stored procedures, and WMI to download and execute a downloader (cabs.exe) which fetched multiple botnet components, and explains the forensic steps and remediation measures taken to eradicate the threat.

SQL ServerStored ProcedureWMI
0 likes · 7 min read
Uncovering a SQL Server Job That Hid a Persistent Malware Loader
Efficient Ops
Efficient Ops
Mar 23, 2019 · Operations

How to Build a Bank Ops SWAT Team for 5‑Minute Incident Recovery

This article explains how a bank can create a specialized Operations SWAT team, define its role, adopt seven essential “weapons” such as layered monitoring, intelligent alerts, communication protocols, automation, and disaster‑recovery tactics, and continuously train the team to meet strict five‑minute recovery targets.

AutomationSWAT teambank operations
0 likes · 21 min read
How to Build a Bank Ops SWAT Team for 5‑Minute Incident Recovery
Efficient Ops
Efficient Ops
Mar 18, 2019 · Operations

How to Build a Bank Ops SWAT Team for Rapid Incident Recovery

This article explains how a bank can create a specialized SWAT‑style operations team, define its roles, adopt seven essential "weapons" such as monitoring and intelligent alerts, and apply ten tactical processes—from communication to automation—to meet strict five‑minute recovery and regulatory requirements.

AutomationSWAT teambank operations
0 likes · 21 min read
How to Build a Bank Ops SWAT Team for Rapid Incident Recovery
Efficient Ops
Efficient Ops
Jan 29, 2019 · Information Security

How Hackers Hijacked a Server with Hidden Accounts and Crypto‑Mining: A Forensic Walkthrough

This article details a multi‑stage server compromise that injected gambling pages, planted hidden accounts, deployed crypto‑mining software, and opened unnecessary ports, providing step‑by‑step forensic analysis, code inspection, emergency response actions, and indicators of compromise.

crypto miningincident responseinformation security
0 likes · 12 min read
How Hackers Hijacked a Server with Hidden Accounts and Crypto‑Mining: A Forensic Walkthrough
NetEase Game Operations Platform
NetEase Game Operations Platform
Dec 10, 2018 · Information Security

Understanding and Improving Operations Security: Practices, Risks, and Enterprise‑Level Solutions

This article explains the concept of operations security, why it has become critical, enumerates common mis‑configurations and vulnerabilities such as open ports, weak permissions, insecure scripts and supply‑chain risks, and provides a comprehensive set of best‑practice guidelines and an enterprise‑level framework to build a resilient operations security posture.

AutomationInfrastructureincident response
0 likes · 28 min read
Understanding and Improving Operations Security: Practices, Risks, and Enterprise‑Level Solutions
Meituan Technology Team
Meituan Technology Team
Nov 8, 2018 · Information Security

Intrusion Detection: Concepts, Challenges, and Best Practices

Effective intrusion detection for large enterprises hinges on combining signature‑based pattern matching with baseline anomaly modeling, gathering comprehensive host and network logs, focusing on the GetShell foothold, managing alert fatigue, and integrating AI‑enhanced feature engineering while maintaining robust operational foundations and continuous expertise development.

AISecurity Operationscybersecurity
0 likes · 31 min read
Intrusion Detection: Concepts, Challenges, and Best Practices
dbaplus Community
dbaplus Community
Sep 12, 2018 · Operations

Mastering Enterprise Ops Security: Habits, Architecture, and Incident Response

This article presents a comprehensive guide to operational security, covering essential habits, a layered technical architecture, access‑control strategies, CI/CD safeguards, DDoS mitigation, data protection, incident‑response procedures, and collaboration with IT, security, and network teams.

CI/CD securityDDoS DefenseData Protection
0 likes · 20 min read
Mastering Enterprise Ops Security: Habits, Architecture, and Incident Response
Beike Product & Technology
Beike Product & Technology
Aug 15, 2018 · Information Security

Malware Incident Response: Analyzing and Removing a Persistent Windows Trojan

This article details a step‑by‑step incident‑response case study of a Windows internal‑network Trojan that exploited SMB port 445, describing how alerts were identified, malicious processes were traced, terminated, and fully removed using tools such as netstat, PChunter, and process monitoring utilities.

Network ScanningWindows securityincident response
0 likes · 6 min read
Malware Incident Response: Analyzing and Removing a Persistent Windows Trojan
Efficient Ops
Efficient Ops
Jul 8, 2018 · Operations

How to Seamlessly Take Over New Ops Responsibilities: A Practical Checklist

This guide outlines a step‑by‑step approach for taking over new operational responsibilities, covering communication with development leaders, business overview, asset inventory, basic and business‑specific monitoring, standardization, SOP creation, failure drills, cost and capacity planning, and effective cross‑team communication.

Operationsasset managementhandovers
0 likes · 10 min read
How to Seamlessly Take Over New Ops Responsibilities: A Practical Checklist
MaGe Linux Operations
MaGe Linux Operations
Jul 7, 2018 · Operations

How to Seamlessly Take Over a New Service: An Operations Playbook

This guide outlines a step‑by‑step operations playbook for assuming responsibility of a new business service, covering initial communication, asset inventory, monitoring setup, standardization, SOP creation, incident drills, ongoing optimization, and effective cross‑team communication to ensure stable, low‑cost, and high‑quality service delivery.

SOPasset managementincident response
0 likes · 9 min read
How to Seamlessly Take Over a New Service: An Operations Playbook
ITPUB
ITPUB
Apr 21, 2018 · Operations

Essential Ops Checklist: Avoid Disasters with Proven Practices

A seasoned operations engineer shares a comprehensive guide covering online operation standards, data handling, security hardening, daily monitoring, performance tuning, and the right mindset to prevent costly incidents and ensure stable, secure, and efficient production environments.

incident responsemonitoring
0 likes · 14 min read
Essential Ops Checklist: Avoid Disasters with Proven Practices
Java Backend Technology
Java Backend Technology
Apr 2, 2018 · Information Security

How a Hidden Cron Job Hijacked My Server and How I Fixed It

A production server running Tomcat, MySQL, MongoDB and ActiveMQ was taken down by a malicious cron job that executed a cryptomining script, and the article walks through the investigation, removal, and hardening steps to fully recover and secure the system.

Linux HardeningServer Securitycron job
0 likes · 4 min read
How a Hidden Cron Job Hijacked My Server and How I Fixed It
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 21, 2017 · Operations

Stability Monitoring Practices for Double 11 2017

The 2017 Double 11 stability monitoring project introduced a four‑layer monitoring architecture—including customer & sentiment, business, system water‑level, and infrastructure monitoring—along with data archiving and system‑level reliability measures to detect, respond to, and mitigate issues far faster than traditional manual processes.

Operationsbig-dataincident response
0 likes · 14 min read
Stability Monitoring Practices for Double 11 2017
21CTO
21CTO
Nov 2, 2017 · Operations

How to Diagnose and Fix Online System Issues Efficiently

This article shares practical methods for frontline engineers to quickly understand, assess, and resolve online system problems by categorizing system layers, evaluating impact, using essential Linux monitoring tools, and applying systematic troubleshooting and design‑for‑failure strategies to minimize downtime.

Linux toolsOnline DebuggingPerformance Monitoring
0 likes · 11 min read
How to Diagnose and Fix Online System Issues Efficiently
dbaplus Community
dbaplus Community
Sep 21, 2017 · Information Security

How I Detected and Fixed a Shellshock Attack on a Linux Server

After a sudden server crash, the author traced a ransomware note, uncovered a Bash Shellshock exploit through log analysis and crafted GET requests, verified the vulnerability, upgraded Bash, and applied post‑compromise hardening steps to fully recover the system.

Bash vulnerabilityLinux securityServer Hardening
0 likes · 11 min read
How I Detected and Fixed a Shellshock Attack on a Linux Server
MaGe Linux Operations
MaGe Linux Operations
Jul 20, 2017 · Information Security

Essential Linux Security Hardening: From Account Safety to Rootkit Detection

This guide outlines comprehensive Linux security practices for administrators, covering account and login protection, service minimization, password and key authentication, sudo usage, system welcome message hardening, remote access safeguards, filesystem permissions, rootkit detection tools, and step‑by‑step response procedures after a server compromise.

Linux securityRootkit Detectionincident response
0 likes · 25 min read
Essential Linux Security Hardening: From Account Safety to Rootkit Detection
21CTO
21CTO
Jul 10, 2017 · Operations

How I Rescued a Production MySQL Database After a Fatal rm -rf Disaster

After a mistaken rm -rf command wiped an entire production server—including MySQL data—the author chronicles a step‑by‑step recovery using ext3grep, custom scripts, and binlog restoration, highlighting lessons learned and best practices for future incident handling.

BackupBinlogData Recovery
0 likes · 9 min read
How I Rescued a Production MySQL Database After a Fatal rm -rf Disaster
Efficient Ops
Efficient Ops
Apr 13, 2017 · Information Security

From Traditional Ops to Automated Security: Ctrip’s Journey and Lessons

This article recounts a Ctrip security engineer’s evolution from early Unix‑based operations to fully automated network security, highlighting challenges in forecasting, application security integration, rapid incident response, and large‑scale firewall automation within a fast‑growing enterprise.

AutomationSecurity Operationsincident response
0 likes · 12 min read
From Traditional Ops to Automated Security: Ctrip’s Journey and Lessons
MaGe Linux Operations
MaGe Linux Operations
Mar 24, 2017 · Information Security

How We Detected and Eliminated a Struts2 Mining Malware Attack

This article recounts a recent incident where a Struts2 vulnerability was exploited to run mining malware, detailing the discovery process, forensic analysis of services, processes, network listeners, and the step‑by‑step remediation measures including script‑based scans, permission hardening, and upgrading Struts2.

Struts2Vulnerabilityincident response
0 likes · 4 min read
How We Detected and Eliminated a Struts2 Mining Malware Attack
Efficient Ops
Efficient Ops
Feb 28, 2017 · Operations

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.

Operationscapacity planninge‑commerce
0 likes · 18 min read
Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response