Operations 11 min read

When a Single Cable Crashes a Network: Real Ops Incident Lessons

This article recounts two real‑world operations incidents—a network outage caused by an improperly configured portfast on a trunk link and an NFS failure that crippled an API service—then distills practical lessons on pre‑incident procedures, monitoring, fault handling, recovery, and post‑mortem practices.

Efficient Ops
Efficient Ops
Efficient Ops
When a Single Cable Crashes a Network: Real Ops Incident Lessons

Case 1: I Just Plugged a Network Cable, Whole Network Went Down!

Environment

During a rack‑mount of a new server, the architecture used two access‑layer switches bonded to the aggregation layer, which ran MSTP+HSRP. An additional access switch was being added to expand the network.

Network diagram
Network diagram

Fault Phenomenon

After plugging the new cable into port 23 of the aggregation switch, the monitoring system raised alarms and websites became inaccessible.

Fault Handling

Confirmed that the network engineer had not made any configuration changes.

Verified that no other personnel had performed plug‑in actions.

Monitored alerts showed host connectivity loss, indicating a network issue.

Realized the cause was the newly inserted cable; unplugging it restored service.

Root Cause Analysis

Portfast, a Cisco Catalyst feature that skips the normal STP listening and learning states, had been enabled on the aggregation switch’s port 23. Portfast should only be used on access‑layer ports connected to end devices, not on ports linking switches, otherwise it creates a loop.

In this case the port was previously used for a server and later repurposed for an access switch, but the Portfast setting remained enabled, leading to the outage.

Case 2: NFS Failure Caused Complete Service Outage

Environment

The backend API stack consisted of Nginx front‑ends and Python services on port 9090, with load balancing via Nginx+Keepalived. A shared NFS was used to store generated QR codes.

Architecture diagram
Architecture diagram

Fault Phenomenon

Monitoring reported

API http code not 200

. All API processes were running, but the services were unreachable.

Fault Handling

Checked API error logs – no anomalies.

Tested API ports – functional; Nginx proxy port 8080 failed.

Observed that only the node running NFS (api‑node1) behaved differently.

Found that Nginx error logs showed many failed URL requests.

Identified that api‑node1 was the only node with NFS mounted.

Root Cause Analysis

The NFS on api‑node1 had crashed, preventing QR‑code generation. Repeated client retries flooded Nginx, causing health‑check failures and marking the backend as down.

Restarting NFS and the API nodes restored normal operation.

Reflections: What to Do Before a Fault?

Operational Discipline

Follow documented rack‑mount and network‑change procedures; unauthorized personnel should not intervene.

Comprehensive Monitoring

Ensure alerts are reviewed thoroughly and prioritized correctly, avoiding missed warnings.

Fault‑Handling Process

Maintain clear, documented steps for incident escalation and resolution.

Reflections: What to Do After a Fault?

Rapid Recovery

ITIL’s incident management emphasizes restoring service as quickly as possible, even if that means using rollback scripts.

Post‑mortem Review

Conduct a joint review with development, testing, and operations to identify prevention measures.

Problem Management

Convert incidents into problem records to eliminate repeat occurrences and mitigate impact.

operationsNetwork Troubleshootingincident managementfault-analysisNFSITIL
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.