Why Passing Tests Aren’t Proof of Correctness: Dijkstra’s Insight & Modern Strategies
The article explains that a green test run only shows the absence of detected bugs under specific inputs, environments, and assumptions, explores the asymmetry between verification and falsification, discusses the test‑oracle problem, property‑based testing, formal verification, and proposes a risk‑calibrated testing approach.
The Asymmetry of Testing
When a test suite passes, developers often claim the code is correct, but correctness only holds under the current test set, execution environment, and oracle assumptions. A green run merely indicates that no known defect was triggered, not that the code is universally error‑free.
Epistemological Perspective
Understanding testing’s epistemology—what it can reveal and what it assumes—is a crucial skill for software engineers. Dijkstra’s famous remark that testing can show the presence of bugs but never prove their absence highlights this asymmetry.
Verification vs. Falsification
Following Popper’s philosophy, testing is a structured search for counter‑examples rather than a certification of truth. A test suite explores a finite subset of the enormous input and state space, while the software may run in virtually infinite conditions.
Illustrative Example
Consider a function adding two 32‑bit integers. Its input space is 2⁶⁴ (~1.84×10¹⁹) combinations. Exhaustively testing even at 1 000 executions per second would take longer than the age of the universe, especially when accounting for thread interleavings and scheduling.
From Proof to Confidence
The real question shifts from “Is my code correct?” to “What knowledge have we gained and how confident can we be?” Confidence derives from evidence strength, the number of assumptions, and the cost of potential failures.
Test Oracle Problem
Writing tests requires a reliable oracle—knowing the expected outcome. For simple functions this is easy, but for complex systems (distributed consensus, financial calculations, machine‑learning models) defining the correct oracle can be as hard as writing the code itself, leading to misleading test results.
Property‑Based Testing
Property‑based testing (e.g., QuickCheck, Hypothesis, jqwik, fast‑check) replaces concrete expected outputs with invariants that must hold for all inputs. For example, instead of asserting reverse([1,2,3]) == [3,2,1], we assert the law reverse(reverse(xs)) == xs for any list xs. This shifts the focus from specific cases to structural properties, though it remains empirical and limited by sampling.
Formal Verification
Formal verification attempts to bridge the gap by mathematically proving that code satisfies a formal specification for all inputs and execution paths. Projects like seL4, CompCert, and TLA⁺‑based designs demonstrate the feasibility, but the cost of creating accurate specifications and trusting the verification tools is high.
Determinism Ladder
Different techniques provide varying levels of certainty at increasing cost: unit/integration tests (low certainty, low cost), property‑based testing (moderate certainty, moderate cost), fuzzing (higher coverage, higher cost), model checking (high certainty for bounded models), abstract interpretation, and theorem proving (very high certainty, very high cost).
Practical, Risk‑Calibrated Approach
Identify high‑risk components (security, financial, data consistency) and write executable invariants.
Use unit tests and property‑based tests to continuously validate those invariants.
Apply aggressive techniques such as fuzzing, replay, and fault injection for the remaining hard‑to‑enumerate space.
Social Dimension of Testing
Beyond technical guarantees, test suites serve as a shared model of system expectations, helping teams communicate assumptions and maintain collective knowledge. Mixing regression tests with exploratory tests without clear intent leads to brittle suites.
Conclusion
Dijkstra’s warning is not pessimism but an invitation to clarify what we know, what we don’t, and to align testing, property‑based methods, and formal verification with the risk profile of each system component.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
