Fundamentals 9 min read

Analyzing the Origins of Flaky Tests: Size, Tooling, and Instability at Google

This article examines why some tests become flaky, showing that larger test binaries and higher RAM usage strongly correlate with instability, while the choice of testing tools has a smaller effect, and offers recommendations for reducing flaky tests in large‑scale continuous integration environments.

Continuous Delivery 2.0
Continuous Delivery 2.0
Continuous Delivery 2.0
Analyzing the Origins of Flaky Tests: Size, Tooling, and Instability at Google

Flaky tests are those that sometimes pass and sometimes fail when run against the same code, making failure signals ambiguous. When a previously passing test starts failing, it may indicate a new bug, but flaky tests can obscure this signal.

Google’s continuous integration system runs about 4.2 million tests, with roughly 63 000 (≈2 %) exhibiting at least one flaky occurrence per week. Understanding and fixing flaky tests requires analyzing their characteristics.

Test Size and Flakiness

Tests are categorized subjectively as small, medium, or large. In a week, 0.5 % of small tests, 1.6 % of medium tests, and 14 % of large tests were flaky, indicating a clear increase in instability with test size.

Objective metrics—binary size and RAM usage—show a strong correlation with flakiness. Larger binaries and higher RAM consumption correspond to higher flaky rates, with linear fits (r² up to 0.94 for the most predictive subsets).

When bucket sizes are adjusted, the correlation improves, suggesting that binary size and RAM are better predictors of flakiness than test size alone.

Tool Influence

Tests written with certain tools (e.g., WebDriver) show higher flaky rates, but this is largely because those tools are used for larger tests. After accounting for size, tool impact is modest.

Further analysis comparing RAM usage and binary size across tools confirms that RAM usage explains more variance in flakiness than the tool itself.

Conclusions

While test size correlates with flakiness, the lack of fine‑grained size categories at Google limits practical use. Objective measures like binary size and RAM usage are strong indicators of test fragility.

Tests written with specific tools appear more flaky, but this is mostly due to their larger size; tool choice alone contributes little.

Before writing large tests, engineers should consider the minimal test needed and be cautious, as larger tests require extra effort to prevent instability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Software Testingcontinuous integrationtest reliabilityflaky-testsGoogle testingtest size
Continuous Delivery 2.0
Written by

Continuous Delivery 2.0

Tech and case studies on organizational management, team management, and engineering efficiency

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.