Fundamentals 17 min read

Why Passing Tests Aren’t Proof of Correctness: Dijkstra’s Insight & Modern Strategies

The article explains that a green test run only shows the absence of detected bugs under specific inputs, environments, and assumptions, explores the asymmetry between verification and falsification, discusses the test‑oracle problem, property‑based testing, formal verification, and proposes a risk‑calibrated testing approach.

FunTester

Apr 9, 2026

Why Passing Tests Aren’t Proof of Correctness: Dijkstra’s Insight & Modern Strategies

The Asymmetry of Testing

When a test suite passes, developers often claim the code is correct, but correctness only holds under the current test set, execution environment, and oracle assumptions. A green run merely indicates that no known defect was triggered, not that the code is universally error‑free.

Epistemological Perspective

Understanding testing’s epistemology—what it can reveal and what it assumes—is a crucial skill for software engineers. Dijkstra’s famous remark that testing can show the presence of bugs but never prove their absence highlights this asymmetry.

Verification vs. Falsification

Following Popper’s philosophy, testing is a structured search for counter‑examples rather than a certification of truth. A test suite explores a finite subset of the enormous input and state space, while the software may run in virtually infinite conditions.

Illustrative Example

Consider a function adding two 32‑bit integers. Its input space is 2⁶⁴ (~1.84×10¹⁹) combinations. Exhaustively testing even at 1 000 executions per second would take longer than the age of the universe, especially when accounting for thread interleavings and scheduling.

From Proof to Confidence

The real question shifts from “Is my code correct?” to “What knowledge have we gained and how confident can we be?” Confidence derives from evidence strength, the number of assumptions, and the cost of potential failures.

Test Oracle Problem

Writing tests requires a reliable oracle—knowing the expected outcome. For simple functions this is easy, but for complex systems (distributed consensus, financial calculations, machine‑learning models) defining the correct oracle can be as hard as writing the code itself, leading to misleading test results.

Property‑Based Testing

Property‑based testing (e.g., QuickCheck, Hypothesis, jqwik, fast‑check) replaces concrete expected outputs with invariants that must hold for all inputs. For example, instead of asserting reverse([1,2,3]) == [3,2,1], we assert the law reverse(reverse(xs)) == xs for any list xs. This shifts the focus from specific cases to structural properties, though it remains empirical and limited by sampling.

Formal Verification

Formal verification attempts to bridge the gap by mathematically proving that code satisfies a formal specification for all inputs and execution paths. Projects like seL4, CompCert, and TLA⁺‑based designs demonstrate the feasibility, but the cost of creating accurate specifications and trusting the verification tools is high.

Determinism Ladder

Different techniques provide varying levels of certainty at increasing cost: unit/integration tests (low certainty, low cost), property‑based testing (moderate certainty, moderate cost), fuzzing (higher coverage, higher cost), model checking (high certainty for bounded models), abstract interpretation, and theorem proving (very high certainty, very high cost).

Practical, Risk‑Calibrated Approach

Identify high‑risk components (security, financial, data consistency) and write executable invariants.

Use unit tests and property‑based tests to continuously validate those invariants.

Apply aggressive techniques such as fuzzing, replay, and fault injection for the remaining hard‑to‑enumerate space.

Social Dimension of Testing

Beyond technical guarantees, test suites serve as a shared model of system expectations, helping teams communicate assumptions and maintain collective knowledge. Mixing regression tests with exploratory tests without clear intent leads to brittle suites.

Conclusion

Dijkstra’s warning is not pessimism but an invitation to clarify what we know, what we don’t, and to align testing, property‑based methods, and formal verification with the risk profile of each system component.