Operations 11 min read

Test In Production (TIP): Microsoft’s Shift‑Right Testing, Fault Injection, and Chaos Engineering Practices

The article explains Microsoft’s Test‑In‑Production (TIP) approach, describing why production is the only true environment, how they use gradual releases, feature flags, telemetry, fault injection, circuit‑breaker testing, and chaos engineering to improve reliability, micro‑service compatibility, and business continuity.

Continuous Delivery 2.0

Jan 6, 2021

Test In Production (TIP): Microsoft’s Shift‑Right Testing, Fault Injection, and Chaos Engineering Practices

Production Environment Is Unique and Cannot Be Replicated

Microsoft reduced reliance on functional testing in offline labs and moved toward unit testing, further shifting testing to the production environment because no test environment can match the breadth, diversity, and constantly changing nature of production workloads.

Laboratory environments cannot faithfully reproduce production scale, real customer load, or the continuous evolution of configurations and infrastructure, so Microsoft has increasingly performed work directly in production.

What We Did

At Microsoft, TIP (Test In Production) refers to two practices: (1) a set of safeguards that protect the production environment, and (2) a set of validations that continuously verify the health and quality of the ever‑changing production system.

To protect production, code changes are released gradually and controllably via gray‑deployment with feature‑flag support.

feature flags——https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/progressive-experimentation-feature-flags

Telemetry provides real‑world test data from actual customer load, capturing failures, anomalies, performance metrics, and security events. These L3 tests run in production accounts.

Fault Injection and Chaos Engineering

Fault injection and chaos engineering are used to observe system behavior under failure, verify that resilience mechanisms work, and ensure that faults remain isolated to the affected subsystem rather than cascading.

弹性机制：https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/patterns-resiliency-cloud

Fault Testing of Circuit Breakers

Using fault injection, we test circuit breakers directly in production because they are hard to validate elsewhere. Two key questions are examined:

When the circuit breaker opens, does the fallback behavior work in production as it does in unit tests?

When a circuit breaker should open, does it actually open, and is the sensitivity threshold configured correctly?

Example: Testing a circuit breaker on Redis Cache

Redis is a non‑critical distributed cache; if it fails, the system should fall back to the original data source. The test forces the circuit breaker open by changing configuration and verifies that calls are redirected to SQL, then reverses the configuration to ensure calls return to Redis.

The test confirms fallback behavior when the breaker opens and validates the breaker’s sensitivity and timeout settings. Without fault injection, such verification would be impossible in a lab.

What We Learned from Fault Injection

Chaos experiments should first run in a canary environment, which serves as the initial gray‑scale. Our own engineering support systems run there, so any failure only impacts us, not customers.

Automating fault‑injection experiments is important because they are costly and the system is constantly evolving.

Business Continuity and Disaster Recovery

We maintain fault‑transfer (failover) plans for all services and subsystems, including impact assessment, dependency mapping, business continuity design, formal disaster‑recovery documentation, and regular recovery drills.

Microservice Compatibility

With over 30 independently deployed microservices, we use a compatibility test suite as part of rolling CI. Because the combinatorial explosion of versions makes exhaustive testing impossible, we rely on production L3 tests to verify compatibility when a service is upgraded.

Key Takeaways

Focus on Building a Fast and Reliable Quality Signal

The quality signal must be quick and trustworthy across dev boxes, main branches, and release branches, giving engineers confidence to ship changes. A slow or unreliable signal stalls the pipeline and magnifies fragility.

Composite Engineers Drive Greater Ownership

Composite engineers foster an end‑to‑end responsibility culture, reducing hand‑offs and increasing team agility while still valuing specialization.

Deploying to Production Counts as 50% Completion

Shipping to production is only half the job; the other half is ensuring quality under real load. Since production never stays static, testing—monitoring, fault injection, failover drills, and other forms—is an ongoing effort.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Reliability Microsoft ChaosEngineering FaultInjection TestInProduction

Written by

Continuous Delivery 2.0

Tech and case studies on organizational management, team management, and engineering efficiency

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.