Tagged articles

incident response

230 articles · Page 3 of 3

Oct 17, 2017 · Operations

System Troubleshooting: A Structured Approach to Diagnosis, Recovery, and Failure‑Resilient Design

This article presents a systematic methodology for diagnosing and resolving online system issues, covering system understanding, impact assessment, rapid recovery techniques, detailed troubleshooting steps with Linux and Java tools, and design principles to mitigate future failures.

Java debuggingLinux toolsincident response

0 likes · 12 min read

System Troubleshooting: A Structured Approach to Diagnosis, Recovery, and Failure‑Resilient Design

dbaplus Community

Sep 21, 2017 · Information Security

How I Detected and Fixed a Shellshock Attack on a Linux Server

After a sudden server crash, the author traced a ransomware note, uncovered a Bash Shellshock exploit through log analysis and crafted GET requests, verified the vulnerability, upgraded Bash, and applied post‑compromise hardening steps to fully recover the system.

Bash vulnerabilityLinux securityShellshock

0 likes · 11 min read

How I Detected and Fixed a Shellshock Attack on a Linux Server

ITPUB

Aug 13, 2017 · Operations

How I Restored Accidentally Deleted Production Data with ext3grep and MySQL Binlog

After a mistaken rm -rf command wiped an entire production server, I detail the step‑by‑step recovery using ext3grep, a custom restoration script, extundelete attempts, and finally MySQL binlog replay, highlighting the challenges and lessons learned.

ext3grepincident responsemysql

0 likes · 9 min read

How I Restored Accidentally Deleted Production Data with ext3grep and MySQL Binlog

MaGe Linux Operations

Jul 20, 2017 · Information Security

Essential Linux Security Hardening: From Account Safety to Rootkit Detection

This guide outlines comprehensive Linux security practices for administrators, covering account and login protection, service minimization, password and key authentication, sudo usage, system welcome message hardening, remote access safeguards, filesystem permissions, rootkit detection tools, and step‑by‑step response procedures after a server compromise.

Linux securityRootkit Detectionincident response

0 likes · 25 min read

Essential Linux Security Hardening: From Account Safety to Rootkit Detection

21CTO

Jul 10, 2017 · Operations

How I Rescued a Production MySQL Database After a Fatal rm -rf Disaster

After a mistaken rm -rf command wiped an entire production server—including MySQL data—the author chronicles a step‑by‑step recovery using ext3grep, custom scripts, and binlog restoration, highlighting lessons learned and best practices for future incident handling.

BinlogData RecoveryLinux

0 likes · 9 min read

How I Rescued a Production MySQL Database After a Fatal rm -rf Disaster

Efficient Ops

Apr 13, 2017 · Information Security

From Traditional Ops to Automated Security: Ctrip’s Journey and Lessons

This article recounts a Ctrip security engineer’s evolution from early Unix‑based operations to fully automated network security, highlighting challenges in forecasting, application security integration, rapid incident response, and large‑scale firewall automation within a fast‑growing enterprise.

AutomationSecurity Operationsincident response

0 likes · 12 min read

From Traditional Ops to Automated Security: Ctrip’s Journey and Lessons

MaGe Linux Operations

Mar 24, 2017 · Information Security

How We Detected and Eliminated a Struts2 Mining Malware Attack

This article recounts a recent incident where a Struts2 vulnerability was exploited to run mining malware, detailing the discovery process, forensic analysis of services, processes, network listeners, and the step‑by‑step remediation measures including script‑based scans, permission hardening, and upgrading Struts2.

MalwareStruts2Vulnerability

0 likes · 4 min read

How We Detected and Eliminated a Struts2 Mining Malware Attack

Efficient Ops

Feb 28, 2017 · Operations

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.

MonitoringOperationscapacity planning

0 likes · 18 min read

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

MaGe Linux Operations

Feb 26, 2017 · Information Security

How We Traced and Stopped a UDP Flood Attack on an Oracle‑Tomcat Server

During the Chinese New Year a client’s Oracle‑Tomcat server was overwhelmed by massive UDP traffic, prompting a forensic investigation that uncovered a hidden Trojan, detailed command‑line analysis, iptables hardening, and the root cause of a weak SSH password left after a hardware upgrade.

Linux forensicsMalware Analysisincident response

0 likes · 5 min read

How We Traced and Stopped a UDP Flood Attack on an Oracle‑Tomcat Server

Efficient Ops

Feb 20, 2017 · Information Security

Inside YY's Security Ops: Real-World Incident Stories and Architecture

This article shares YY's security operations journey, detailing real incident response scenarios, the evolution of their security infrastructure from 2012 onward, and the key factors considered when building a robust security ops system, including DDoS protection, WAF, vulnerability scanning, intrusion detection, and data‑driven automation.

DDoS protectionSecurity Operationsbig data analytics

0 likes · 24 min read

Inside YY's Security Ops: Real-World Incident Stories and Architecture

Efficient Ops

Feb 16, 2017 · Operations

Why a Missed DNS Renewal Shut Down Our Site—and How We Fixed It

A detailed post‑mortem recounts how a forgotten domain renewal caused a DNS outage, the frantic troubleshooting steps across teams, temporary work‑arounds like switching to Google DNS, and the lessons learned for future incident management.

DNSdomain managementincident response

0 likes · 13 min read

Why a Missed DNS Renewal Shut Down Our Site—and How We Fixed It

Efficient Ops

Feb 2, 2017 · Operations

What Happens When a Production Database Is Accidentally Deleted? Lessons from GitLab’s Disaster

This article recounts the GitLab production database deletion incident, analyzes why backup mechanisms failed, shares technical and cultural lessons on operational practices, and offers concrete recommendations for building resilient, high‑availability systems to prevent data loss.

Operationsbackupincident response

0 likes · 16 min read

What Happens When a Production Database Is Accidentally Deleted? Lessons from GitLab’s Disaster

21CTO

Feb 2, 2017 · Operations

What GitLab’s 300 GB Data Loss Teaches About Backup and Ops Discipline

The GitLab production database was mistakenly deleted during a manual fix, exposing gaps in backup strategies, PostgreSQL configuration, and operational practices, and prompting a detailed post‑mortem that highlights the need for automated recovery, proper tooling, and transparent incident handling.

Data lossOperationsPostgreSQL

0 likes · 15 min read

What GitLab’s 300 GB Data Loss Teaches About Backup and Ops Discipline

dbaplus Community

Jan 25, 2017 · Information Security

Effective Server Security Incident Response: Step‑by‑Step Guide

When a production server is compromised, abrupt actions like pulling the plug can disrupt services, so this guide outlines an eight‑stage, evidence‑driven response process—including verification, on‑site preservation, containment, impact assessment, online analysis, backup, deep forensics, and reporting—plus real‑world case studies and concrete command examples.

Case StudyForensicsLinux

0 likes · 14 min read

Effective Server Security Incident Response: Step‑by‑Step Guide

ITPUB

Jan 17, 2017 · Information Security

How to Diagnose and Eradicate a Linux Trojan That Spikes Outbound Traffic

This article recounts a real‑world incident on an Ubuntu 12.04 server where massive outbound traffic was traced to a hidden trojan, detailing step‑by‑step investigation, identification of malicious processes, removal techniques, and preventive hardening measures.

Rootkitincident responseiptables

0 likes · 9 min read

How to Diagnose and Eradicate a Linux Trojan That Spikes Outbound Traffic

Nightwalker Tech

Nov 9, 2016 · Operations

Best Practices for Service Monitoring and Alerting in E‑commerce Systems

The discussion outlines essential service‑monitoring techniques—including health checks, JVM metrics, traffic and payment ring‑ratio analysis, client‑side exception tracking, third‑party CDN monitoring, alert thresholds, instrumentation via AOP or SDKs, and tooling such as Datadog, Zabbix, and the Elastic stack—to reliably detect and respond to incidents in e‑commerce environments.

AlertingLoggingMonitoring

0 likes · 10 min read

Best Practices for Service Monitoring and Alerting in E‑commerce Systems

360 Zhihui Cloud Developer

Oct 14, 2016 · Operations

Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method

This guide teaches operations engineers a four‑step “Wang‑Wen‑Wen‑Qie” methodology—observing system health, listening via alerts, questioning changes, and pulse‑checking with profiling tools—to rapidly diagnose and resolve high‑impact incidents while maintaining clear communication and post‑mortem learning.

MonitoringOperationsincident response

0 likes · 5 min read

Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method

ITPUB

Sep 21, 2016 · Operations

How Tencent Cloud Recovered Lost Data in a 2‑Day Storage Crisis

In a two‑day incident, Tencent Cloud's CBS team diagnosed cell failures, implemented directed reads and a dual‑cell merge strategy, and restored three‑copy data integrity, while uncovering monitoring gaps and tool limitations that inform future storage operations.

Data RecoveryDistributed storagecloud storage

0 likes · 9 min read

How Tencent Cloud Recovered Lost Data in a 2‑Day Storage Crisis

MaGe Linux Operations

Sep 20, 2016 · Information Security

Step‑by‑Step Guide to Investigating a Compromised Linux System

Learn how to systematically examine a Linux host for compromise by checking logs, user accounts, processes, file integrity, network activity, scheduled tasks, services, and rootkits, using built‑in commands and RPM verification to uncover hidden threats.

ForensicsLinuxRootkit Detection

0 likes · 5 min read

Step‑by‑Step Guide to Investigating a Compromised Linux System

Efficient Ops

Sep 13, 2016 · Operations

How Google SRE Principles Compare Across Industries

This article, excerpted from the upcoming Chinese edition of “SRE: Google Site Reliability Engineering”, examines how Google’s SRE guiding philosophies—disaster planning, post‑mortem culture, automation, and data‑driven decision‑making—are adopted, adapted, or contrasted in sectors such as manufacturing, aerospace, nuclear, telecommunications, healthcare, and finance, highlighting key similarities, differences, and lessons for Google and the broader tech industry.

AutomationOperationsRisk Management

0 likes · 21 min read

How Google SRE Principles Compare Across Industries

Efficient Ops

May 30, 2016 · Information Security

Why Weak Passwords and Unpatched Redis Threaten Operational Security

The article explains how weak passwords, misconfigured services like Redis, careless port changes, and leaked data enable attackers to compromise servers and internal networks, illustrating each risk with real‑world case studies and offering practical mitigation advice for robust ops security.

Redis vulnerabilitydata breachincident response

0 likes · 11 min read

Why Weak Passwords and Unpatched Redis Threaten Operational Security

dbaplus Community

May 27, 2016 · Databases

How to Keep DBA Operations Error‑Free: 5 Essential Practices

This article shares practical DBA advice—pre‑operation preparation, thorough fault analysis, effective communication, mandatory backups, and post‑incident reviews—to help database administrators maintain stability and avoid costly mistakes during online operations.

DBAbest practicesincident response

0 likes · 7 min read

How to Keep DBA Operations Error‑Free: 5 Essential Practices

MaGe Linux Operations

Apr 29, 2016 · Information Security

How to Analyze and Recover from a Linux Rootkit Intrusion

This article walks through a real-world Linux server compromise, detailing the attack symptoms, forensic analysis steps, rootkit discovery, exploitation of an Awstats script vulnerability, and practical remediation measures to restore and harden the affected system.

AwstatsForensicsLinux

0 likes · 14 min read

How to Analyze and Recover from a Linux Rootkit Intrusion

Efficient Ops

Mar 8, 2016 · Information Security

How to Build an Effective Information Security Response Plan Before a Breach

This article outlines why proactive information‑security preparedness, cross‑department response teams, and clear incident‑response checklists are essential for minimizing damage and maintaining trust when a data breach occurs.

Risk ManagementSecurity Operationsdata breach

0 likes · 14 min read

How to Build an Effective Information Security Response Plan Before a Breach

Efficient Ops

Dec 16, 2015 · Operations

Mastering Ops Team Communication and Process Standards for Effective Management

This article outlines practical communication techniques, environment choices, active listening, and emotional control for operations teams, then details how to define standards, build robust processes, and ensure reliable business continuity through clear responsibilities, monitoring, automation, and continuous improvement.

Process Standardsincident responseoperations best practices

0 likes · 11 min read

Mastering Ops Team Communication and Process Standards for Effective Management

Java High-Performance Architecture

Dec 10, 2015 · Information Security

How a Compromised Alibaba Cloud Server Was Recovered: Key Security Lessons

A sudden shutdown of an Alibaba Cloud server revealed it had been hijacked as a bot, prompting a step‑by‑step remediation that highlights the importance of basic security hardening such as securing Redis, changing default SSH settings, and enabling cloud monitoring alerts.

Alibaba Cloudincident responseinformation security

0 likes · 3 min read

How a Compromised Alibaba Cloud Server Was Recovered: Key Security Lessons

ITPUB

Nov 16, 2015 · Information Security

5 Hidden Signs Your Web Application Is Compromised and How to Respond

The article outlines five subtle indicators of a web application breach—abnormal behavior, irregular logs, unexpected processes or users, file modifications, and warning messages—while offering practical monitoring and remediation steps to help security teams detect and mitigate attacks early.

application monitoringincident responselog analysis

0 likes · 7 min read

5 Hidden Signs Your Web Application Is Compromised and How to Respond

Efficient Ops

Jul 6, 2015 · Operations

How to Tame “Thorny” Employees Without Undermining Your Ops Team

This article explores the characteristics, causes, and practical strategies for managing difficult or “thorny” team members in operations, offering case studies and step‑by‑step recommendations to mitigate risks while maintaining team performance.

Leadershipemployee performanceincident response

0 likes · 9 min read

How to Tame “Thorny” Employees Without Undermining Your Ops Team

MaGe Linux Operations

Apr 24, 2015 · Operations

10 Proven Fault Management Practices Every Ops Team Should Master

This guide shares ten practical fault‑management techniques—ranging from proactive attitude and prioritizing incidents to continuous follow‑up and team collaboration—to help operations teams reduce damage, maintain service reliability, and keep users engaged during outages.

Operationsbest practicesfault management

0 likes · 8 min read

10 Proven Fault Management Practices Every Ops Team Should Master

MaGe Linux Operations

Aug 19, 2014 · Operations

Essential Linux Security Checklist: 11 Steps to Detect Compromise

This guide provides a comprehensive 11‑step Linux security inspection checklist, covering account verification, log analysis, process and file checks, package integrity, network monitoring, scheduled tasks, backdoor detection, kernel modules, services, and rootkit scanning to help identify system compromises.

Linux commandsOperationsincident response

0 likes · 5 min read

Essential Linux Security Checklist: 11 Steps to Detect Compromise