Operations 9 min read

Automated Failure Testing (Training Smarter Monkeys) – Netflix’s Implementation of the Molly Algorithm

Netflix describes how it extended the academic Molly fault‑injection method into an automated, production‑scale failure‑testing system that explores dependency trees, defines success criteria, runs short low‑impact experiments, and discovers hidden faults before they affect users.

DevOps

Jun 15, 2021

Automated Failure Testing (Training Smarter Monkeys) – Netflix’s Implementation of the Molly Algorithm

Introduction – Netflix found proactive fault testing essential for uncovering hidden production issues and improving reliability. Manual testing was tedious and limited to single‑service failures, so the team sought an automated approach inspired by the academic Molly fault‑injection method.

Exploration Algorithm – Using the Molly concept, Netflix built a dependency‑tree‑driven fault injection (LDFI) technique that enumerates all possible failure points (A, R, P, B, etc.) for a given request, randomly selects combinations, injects faults, and observes three possible outcomes: request failure, request success (fault irrelevant), or request success after automatic fail‑over.

The algorithm does not prescribe a search order; Netflix implemented a heuristic that lists all single‑point failures, then expands to two‑point combinations, and so on, pruning explored paths once a fault is identified.

Automated Failure Testing Implementation

2.1 Dependency‑Tree Construction – Netflix leverages its tracing system and the FIT (Fault Injection Test) service to build a request‑level dependency graph, identifying injection points such as Hystrix commands, cache lookups, DB queries, and HTTP calls.

2.2 Success Criteria – Success is defined by user‑experience metrics reported by devices; simple HTTP status codes are insufficient because partial successes may still impact users. Metrics are used to decide whether a request caused user‑visible degradation.

2.3 Idempotent Operations – To safely replay requests, Netflix groups them into equivalence classes based on path, parameters, and device information. The team explored machine‑learning mapping of request classes but currently focuses on Falcor‑generated requests.

User Impact – Experiments are limited in scope (20‑30 seconds, affecting ≤10 users) to keep risk low. Failure detection thresholds (e.g., >75 % failure rate) are used to flag real faults while filtering false positives. Even with aggressive experiment rates, the daily user impact remains negligible.

Results – The prototype successfully explored the massive fault space of the critical “App Boot” request (≈2^100 possible fault combinations) with only ~200 experiments, uncovering five distinct failure scenarios, including a combined fault point. Detected faults still require manual remediation.

Netflix plans to scale the system to automatically search larger request spaces and pre‑emptively fix user‑impacting faults before they occur.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automated Testing chaos engineering Reliability Fault Injection Netflix Molly algorithm

Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.