LinkedIn Redliner: Automated Capacity Planning and Performance Testing in Production
The article explains how LinkedIn’s Redliner system automatically measures service capacity and performs low‑impact, production‑traffic stress tests to identify bottlenecks, guide resource allocation, and support proactive capacity planning and performance regression testing.
LinkedIn operates hundreds of internal services for over 467 million users, frequently facing questions about maximum QPS, handling 150% of peak traffic, and pinpointing infrastructure bottlenecks such as CPU or I/O.
To provide precise answers, the performance team needed a solution that could evaluate capacity using real production traffic, minimize impact on user experience, reduce operational costs, and automate scaling.
Redliner is this solution: it conducts automated, incremental stress tests in production, gradually increasing traffic to a target service until it can no longer handle additional load, thereby determining the service’s throughput limits and redundancy capacity.
Redliner is designed around two core principles: low impact on production and full automation. It redirects traffic gradually, monitors health via the EKG system, and adjusts load based on real‑time performance metrics, also assessing downstream effects.
Low impact is achieved by incrementally adding traffic, continuously checking service health, and stopping or reducing load if any health rule is violated.
Full automation replaces manual testing by automatically launching tests, evaluating throughput, checking for performance degradation alerts, and gracefully stopping or rolling back when issues arise. Tests typically finish within an hour and produce reports that highlight latency changes and resource bottlenecks.
The Redliner architecture consists of three main components: the traffic‑shifting layer (proxy/load balancer), the service health analyzer, and the service data collector.
The traffic‑shifting layer works only with stateless services, using LinkedIn’s LiX system to re‑route traffic to the Service Under Test (SUT) without affecting other instances.
The data collector gathers real‑time metrics (QPS, latency, error rate, CPU/memory usage) from Autometrics, a push‑based metrics collection system.
The health analyzer uses EKG to evaluate these metrics against predefined health rules, allowing Redliner to compare normal and test traffic conditions.
In practice, Redliner iteratively increases load, monitors health, and stops when the service reaches its redline – the maximum sustainable QPS. Figures illustrate inbound and system metric rules, as well as test results showing QPS and latency over time.
Use cases include reducing data‑center overhead by identifying over‑provisioned services, proactive capacity planning with alerts for potential bottlenecks, and performance regression testing across different service versions.
Redliner has helped LinkedIn reclaim resources, predict future capacity needs, and prevent releases with performance regressions.
The system was built by a cross‑team effort, with contributions from many engineers, and the original English article can be found on LinkedIn’s engineering blog.
References: 1. https://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin 2. https://engineering.linkedin.com/52/autometrics-self-service-metrics-collection 3. https://engineering.linkedin.com/blog/2015/11/monitoring-the-pulse-of-linkedin 4. https://engineering.linkedin.com/blog/2016/04/faster-and-easier-service-deployment-with-lps--our-new-private-c 5. Original English article: https://engineering.linkedin.com/blog/2017/02/redliner--how-linkedin-determines-the-capacity-limits-of-its-ser
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.