Who Was the World’s First SRE? Uncovering Margaret Hamilton’s Legacy
This article explores the origins of Site Reliability Engineering, highlights Margaret Hamilton as the likely first SRE through her work on NASA’s Apollo program, and draws lessons on reliability, disaster prevention, and the evolution of modern SRE practices.
Preface
Who is the world’s first SRE?
When was the first SRE officially recognized? (You might be surprised)
Was the first SRE a woman?
1. Understanding SRE
There is a saying that a system that has been developed and deployed to production is "stable" and does not need many engineers to optimize or maintain it. From today’s perspective, that statement is clearly wrong.
Google’s view is that if software engineering focuses on designing and building systems, another profession should focus on the entire lifecycle management of those systems.
Google calls this role a Site Reliability Engineer (SRE) , a profession that requires a broad skill set distinct from other roles.
What SRE Really Is
First , SREs are engineers. They use computer science and software engineering techniques to design and develop large, distributed software systems. They often collaborate with product development teams and sometimes build additional components such as backup and load‑balancing systems, aiming for reuse across projects.
Second , SREs focus on reliability . Ben Treynor Sloss, Google’s VP of Operations and the inventor of the SRE title, states that reliability is the most fundamental concept in any product design: *If a system cannot be used reliably, it has no purpose.* SREs continuously improve architecture and operational processes to make systems more reliable, scalable, and resource‑efficient.
Finally , SREs operate services on distributed clusters, ranging from global storage services to email and the original Google web search. Over time, SREs have taken over most of Google’s internal products, including Google Cloud Platform and infrastructure systems like Bigtable.
2. The First SRE: Margaret Hamilton
NASA’s legendary programmer Margaret Hamilton, a MIT professor, is identified in the Chinese translation of "Google SRE" as the world’s first SRE. She contributed to the software development of the Apollo program.
Was the first SRE born from the Apollo program?
During the development of Apollo 7 (circa 1968), Margaret brought her young daughter Lauren to the lab. While Margaret was running a flight‑simulation test on a mainframe, Lauren accidentally pressed the DSKY key, causing the simulation to crash.
The unexpected crash terminated the rocket launch program.
Investigation revealed that Lauren triggered the P01 sub‑program, which deletes navigation data and would render the onboard computer unable to continue the flight.
P01 is a pre‑flight debugging routine that, if executed during flight, would cause a catastrophic loss of navigation.
Using SRE intuition, Margaret proposed a software change to add a special state check that would prevent accidental execution of P01 during flight. NASA management rejected the change as too unlikely to occur.
Consequently, Margaret could only add a warning in the flight manual: "Do not trigger P01 during flight."
When Apollo 8 later experienced an accidental P01 trigger, the warning in the manual helped engineers quickly restore data and continue the mission, preventing a potential disaster.
"No matter how well you understand a software system, you cannot prevent human error entirely," Margaret once said.
This embodies the core SRE principle: true reliability requires anticipating and mitigating human mistakes.
Margaret’s Trailblazing Path
In an era that discouraged women from high‑intensity technical work, Margaret pursued a non‑traditional programmer career, balancing scientific passion with family life.
She mastered multiple assembly languages required for the various computer modules on the Apollo spacecraft, contributed to the first Kalman filter implementation, and led software development at MIT’s Instrumentation Laboratory.
Her work aimed to prevent crashes, directly supporting the success of Apollo 11’s lunar landing. In 2003, she received NASA’s Distinguished Service Medal and is celebrated as one of the greatest programmers in history.
Engineers today continue to encounter similar reliability challenges, underscoring the timeless relevance of SRE principles.
Only by obsessively attending to details, preparing thorough disaster‑recovery plans, and staying vigilant can we truly avoid catastrophic failures.
This is the most important SRE philosophy!
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.