Cloud Computing 13 min read

Throttling Pattern for Cloud Applications: Managing Resource Consumption and SLA Compliance

The article explains how to use throttling together with auto‑scaling to control resource consumption of cloud applications, prevent tenant overload, handle traffic bursts, and ensure service‑level agreements while optimizing costs.

Architects Research Society
Architects Research Society
Architects Research Society
Throttling Pattern for Cloud Applications: Managing Resource Consumption and SLA Compliance

Control the resource consumption used by an application instance, a single tenant, or the entire service. This can keep the system running and meet service level agreements even when demand causes heavy load.

Background and Problem

Load of cloud applications usually varies over time according to the number of active users or the type of activities they perform. Sudden and unexpected bursts can occur. If processing demand exceeds the capacity of available resources, performance degrades or the system may fail, which is unacceptable when an SLA must be honored.

Many strategies exist to handle different loads in the cloud depending on business goals. One strategy is auto‑scaling, which matches supplied resources to user demand at any given time, helping to meet demand while optimizing cost. However, auto‑scaling is not instantaneous; rapid demand spikes can create a short window of resource shortage.

Solution

Another auto‑scaling strategy is to allow an application to use resources only within a defined limit and to restrict them once the limit is reached. The system should monitor resource usage so that when consumption exceeds a threshold it can throttle requests from one or more users, keeping the system operational and meeting any existing SLA. See the Instrumentation and Telemetry Guidance for more on monitoring resource usage.

The system can implement several throttling strategies, including:

Reject requests from a single user that exceed n API calls per second within a given time window. This requires per‑tenant or per‑user usage metering (see Service Metering Guidance).

Disable or downgrade selected non‑essential services so that essential services can run unimpeded when resources are scarce (e.g., switch video streaming to lower resolution).

Use load balancing to smooth activity (queue‑based load‑balancing is described elsewhere). In multi‑tenant environments this may reduce per‑tenant performance, so high‑value tenants can be prioritized while lower‑priority requests are delayed until backlog eases.

Delay operations belonging to lower‑priority applications or tenants; the system can generate an exception notifying the tenant that the system is busy and the operation should be retried later.

The diagram below shows an area chart of resource usage (memory, CPU, bandwidth, etc.) over time for an application using three features (A, B, C). Each feature is a region representing a component that performs a specific set of tasks.

The area directly under a feature’s line indicates the resources used when that feature is invoked. The region between the lines of features A and B represents the combined resource usage of applications calling feature B, and the aggregated area for each feature shows the total system resource consumption.

The next figure illustrates the impact of delayed operations. Before time T1, total allocated resources reach the soft limit, creating a risk of exhaustion. Feature B, being less critical than A or C, is temporarily disabled and its resources are released. Between T1 and T2, applications using features A and C continue to run normally. Eventually, resource usage for A and C drops enough to re‑enable feature B at time T2.

Auto‑scaling and throttling can also be combined to keep applications responsive within SLA bounds. If demand is expected to stay high, throttling provides a temporary solution while the system scales out; once additional capacity is available, full functionality can be restored.

Issues and Considerations

Throttling decisions affect overall system architecture and should be considered early in the application design process.

The throttling mechanism must react quickly to traffic spikes and recover swiftly when load decreases, requiring continuous performance data collection.

If a service temporarily rejects user requests, it should return a specific error code so client applications understand the reason and can retry after a delay.

When auto‑scaling, throttling can serve as a temporary measure; for sudden bursts that are not expected to persist, throttling may be cheaper than scaling.

If demand grows extremely fast, throttling alone may not keep the system alive; in such cases consider larger capacity reserves or more aggressive auto‑scaling policies.

When to Use This Pattern

Ensure the system continues to meet service level agreements.

Prevent a single tenant from monopolizing resources.

Handle bursts of activity.

Help optimize system cost by keeping resource usage within a maximum level needed for operation.

Example

The final diagram shows how throttling is implemented in a multi‑tenant system where users from each tenant fill out and submit surveys. The application monitors the request rate from each tenant and limits the number of requests per second a tenant can submit. Requests exceeding the limit are blocked.

Next Steps

When implementing this pattern, the following guidance may also be relevant:

Instrumentation and Telemetry Guidance – describes how to generate and capture custom monitoring information.

Service Metering Guidance – explains how to meter service usage to understand how services are used, which informs throttling decisions.

Auto‑Scaling Guidance – contains information about auto‑scaling strategies and how throttling can be used as a temporary measure or to eliminate the need for auto‑scaling.

Related Guidance

Other patterns that may be relevant when implementing this pattern include:

Queue‑Based Load‑Balancing – a common mechanism for implementing throttling, where a queue buffers requests to smooth traffic.

Priority Queue Pattern – using priority queues as part of throttling to maintain performance for critical or high‑value applications while reducing performance for less important ones.

resource managementSLAmulti-tenantcloudAuto Scalingthrottling
Architects Research Society
Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.