How One Line of C Code Crippled AT&T’s Network for 9 Hours
A 1990 AT&T network outage caused by an untested C code change led to a nine‑hour service collapse, massive financial loss, and widespread disruption, illustrating how a single software bug can trigger cascading failures in large‑scale telecommunications systems.
Incident Overview
On January 15, 1990, AT&T’s New Jersey operations center detected a massive system failure, with red warnings flashing across network displays. The outage persisted for nine hours, causing a 50% call‑connection failure rate, over $60 million in losses, more than 60,000 phones rendered unusable, and delays affecting 500 flights and 85,000 passengers.
Background
AT&T’s long‑distance network was considered a model of efficiency, employing advanced electronic switches and signaling systems that typically routed calls within seconds.
Root Cause
The failure originated from a switch in New York. A recent software update introduced a coding error that affected 114 switches across the network. When the New York switch reset and sent a signal, the error triggered a domino effect, leading to a widespread network collapse.
Interestingly, the faulty software had bypassed testing because the code change was deemed minor and was approved by management without verification.
Problem Details
The error occurred in a C program involving a misplaced break statement within nested conditional logic, causing data overwrites and system resets.
Faulty Pseudocode
while (ring receive buffer not empty and side buffer not empty):
Initialize pointer to first message in side buffer or ring receive buffer
get copy of buffer
switch (message):
case (incoming_message):
if (sending switch is out of service):
if (ring write buffer is empty):
send "in service" to status map
else:
break // The error was here!
END IF
process incoming message, set up pointers to optional parameters
break
END SWITCH
do optional parameter workAnalysis
If the ring write buffer is not empty, the if on line 7 is skipped, and execution jumps to the break on line 10.
The break prevents the intended processing of the incoming message and the setup of pointers for optional parameters.
When the break executes, data that should have been retained (the pointer) is overwritten.
Error‑detection software flags the data overwrite and initiates a shutdown, resetting the switch. Because every switch ran the same flawed software, a chain reaction of resets crippled the entire network.
Repair
Engineers spent nine hours restoring AT&T’s system to normal operation, primarily by rolling back the switches to a previous, stable code version. In reality, software engineers required two weeks of intensive code review, testing, and replication to fully understand and fix the bug.
Conclusion
This incident was not AT&T’s largest system collapse of the 1990s, but it underscored how even well‑designed networks can be vulnerable to human error and process gaps. Modern companies have improved processes, yet similar failures continue to arise when testing is skipped and code changes are not rigorously validated.
Original sources: engineercodex.substack.com and jdon.com
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
