When a Single Cable Crashes a Network: Real Ops Incident Lessons
This article recounts two real‑world operations incidents—a network outage caused by an improperly configured portfast on a trunk link and an NFS failure that crippled an API service—then distills practical lessons on pre‑incident procedures, monitoring, fault handling, recovery, and post‑mortem practices.
Case 1: I Just Plugged a Network Cable, Whole Network Went Down!
Environment
During a rack‑mount of a new server, the architecture used two access‑layer switches bonded to the aggregation layer, which ran MSTP+HSRP. An additional access switch was being added to expand the network.
Fault Phenomenon
After plugging the new cable into port 23 of the aggregation switch, the monitoring system raised alarms and websites became inaccessible.
Fault Handling
Confirmed that the network engineer had not made any configuration changes.
Verified that no other personnel had performed plug‑in actions.
Monitored alerts showed host connectivity loss, indicating a network issue.
Realized the cause was the newly inserted cable; unplugging it restored service.
Root Cause Analysis
Portfast, a Cisco Catalyst feature that skips the normal STP listening and learning states, had been enabled on the aggregation switch’s port 23. Portfast should only be used on access‑layer ports connected to end devices, not on ports linking switches, otherwise it creates a loop.
In this case the port was previously used for a server and later repurposed for an access switch, but the Portfast setting remained enabled, leading to the outage.
Case 2: NFS Failure Caused Complete Service Outage
Environment
The backend API stack consisted of Nginx front‑ends and Python services on port 9090, with load balancing via Nginx+Keepalived. A shared NFS was used to store generated QR codes.
Fault Phenomenon
Monitoring reported
API http code not 200. All API processes were running, but the services were unreachable.
Fault Handling
Checked API error logs – no anomalies.
Tested API ports – functional; Nginx proxy port 8080 failed.
Observed that only the node running NFS (api‑node1) behaved differently.
Found that Nginx error logs showed many failed URL requests.
Identified that api‑node1 was the only node with NFS mounted.
Root Cause Analysis
The NFS on api‑node1 had crashed, preventing QR‑code generation. Repeated client retries flooded Nginx, causing health‑check failures and marking the backend as down.
Restarting NFS and the API nodes restored normal operation.
Reflections: What to Do Before a Fault?
Operational Discipline
Follow documented rack‑mount and network‑change procedures; unauthorized personnel should not intervene.
Comprehensive Monitoring
Ensure alerts are reviewed thoroughly and prioritized correctly, avoiding missed warnings.
Fault‑Handling Process
Maintain clear, documented steps for incident escalation and resolution.
Reflections: What to Do After a Fault?
Rapid Recovery
ITIL’s incident management emphasizes restoring service as quickly as possible, even if that means using rollback scripts.
Post‑mortem Review
Conduct a joint review with development, testing, and operations to identify prevention measures.
Problem Management
Convert incidents into problem records to eliminate repeat occurrences and mitigate impact.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.