Load Shedding with NGINX using adaptive concurrency control — Part 2

Vikas Kumar
OLX Engineering
Published in
9 min readFeb 25, 2021

--

Author’s Note: This is a two-part series about how we implemented and operationalized load shedding in our services using adaptive concurrency control with NGINX. Part 1 sets the context/background and part 2 focusses on the implementation.

I wrote about this technique in much detail last year (Part 1, 2, 3, and 4). Some of the content here is adopted from these posts. Although I have tried to make this post self-sufficient, if you want further deep-dive on this topic, I’d recommend going through the other four posts and their attached reference material.

Here is the link to part 1.

As noted in the previous post, we deploy NGINX as a reverse proxy sidecar with most of our services. NGINX already comes with a static concurrency control feature by default — the ngx_http_limit_conn_module module can limit the number of connections, which roughly translates to the total number of active requests to the upstream service (In case of HTTP/2, each concurrent request is considered a separate connection). Adaptive concurrency control, however, is not supported by default. Enter OpenResty. In their own words:

OpenResty® is a dynamic web platform based on NGINX and LuaJIT.

With OpenResty, we can extend NGINX functionality using Lua scripting. We already had the algorithms from Netflix’s concurrency limits library. All we needed was a way to implement them in OpenResty with Lua. Thankfully, OpenResty publishes a lot of modules and packages contributed by the community as OpenResty Package Manager. This is where I discovered the lua-resty-limit-traffic library, which had already implemented static concurrency limiting. This library ended up being the foundation for implementing the adaptive concurrency limiting (Thanks agentzh for the library and OpenResty).

How OpenResty Extends NGINX

NGINX processes HTTP requests in multiple phases. Each phase can have 0 or more handlers. With OpenResty, we can hook our custom Lua code in these phases. Here is a depiction of hooks provided by OpenResty for different request phases:

Image Credit: https://dev.to/satrobit/extend-nginx-with-lua-ddos-mitigation-using-cookie-validation-5bj9

lua-resty-limit-traffic Library

Here is how lua-resty-limit-traffic library works at a very high level:

Define a shared dictionary to store the number of concurrent requests

lua_shared_dict allows you to create a shared memory zone to save state across requests. This memory zone is shared by all NGINX workers and provides thread-safe methods to store and retrieve data. This dictionary will be used to keep track of concurrent requests, which we’ll refer to as In-Flight Requests (IFR).

lua_shared_dict my_limit_conn_store 100m;

Define the max concurrency

We need to define the max allowed concurrency (Let’s call it maxIFR). If IFR reaches this value, further requests will be rejected with 503 status.

Handler for incoming requests

Handler for incoming requests is defined at access_by_lua_block. Here, we first check the current IFR by retrieving it from the shared dictionary. If it’s equal to maxIFR, reject the request with 503 status, else allow the request and increment the IFR value in the dictionary.

Handler for completed requests

This handler is defined at log_by_lua_block. Here, we decrement the IFR in the shared dictionary.

There are some more functionalities implemented by this library such as bursting, but that’s not particularly relevant for this post.

Implementing Adaptive Concurrency Limiting

For making concurrency limiting adaptive, We decided to implement Additive Increase Multiplicative Decrease (AIMD) and Gradient algorithms from the concurrency limits library.

AIMD Algorithm

AIMD is a simple algorithm that works by observing latency from upstream service and periodically adjusting the concurrency limit by comparing it with a pre-defined maximum acceptable latency (Lmax). As long as observed latency stays within Lmax, the concurrency limit is increased additively (increment by 1). If observed latency goes beyond Lmax, we decrease the concurrency limit multiplicatively (multiply with x, where 0<x<1).

The implementation can be broken down into 3 pieces:

Handling Incoming Requests

This is quite similar to the equivalent phase in the lua-resty-limit-traffic library except for the concurrency limit (maxIFR) is not statically defined. It is stored and updated in the shared memory zone.

During initialization, we configure the following parameters:

  • initialConcurrencyLimit: Concurrency limit to start with
  • maxConcurrencyLimit: Maximum concurrency limit
  • minConcurrencyLimit: Minimum concurrency limit

Initially, maxIFR is set to initialConcurrencyLimit which is then periodically updated as per the algorithm. During the check, the latest value is fetched from the memory zone to compare with the current IFR value.

Handling Completed Requests

When a request is completed, we decrement the IFR. Additionally, we record the latency of this request in a circular buffer which is later used by the limit adjustment mechanism.

Adjusting Concurrency Limit

This mechanism is the heart of the module. For AIMD, we need the following:

  • Current concurrency limit
  • Observed latency
  • Max acceptable latency

We start with initialConcurrencyLimit and periodically (defined by a windowSize parameter) adjust the limit by comparing observed latency with maximum acceptable latency. During initialization, we start a timer that fires every windowSize seconds and triggers the logic to adjust the concurrency limit.

The algorithm is rather simple:

  • If observed latency is less than the maximum acceptable latency, increment the concurrency limit by 1, up to maxConcurrencyLimit
  • If observed latency is greater than the maximum acceptable latency, reduce concurrency by multiplying with backoffRatio. backoffRatio is configurable and must be ≥0.5 and <1.

Measuring observed latency

Observed latency from the upstream service is measured by collecting the latency samples for a defined duration (say 5 seconds) and then taking either the average or the Nth percentile (configurable). This duration is also the frequency at which we update the concurrency limit.

The parameters to be configured during initialization are:

  • windowSize: Time window to compute observed latency
  • minRequests: Minimum number of requests that should come in the window for us to be able to compute latency.
  • metric: Metric used to compute latency. It can either be average or percentile
  • percentileVal: If the metric is percentile, we need to specify a percentile value e.g. 99 or 95 or 90. It must be greater than 50.

To illustrate, consider the following example:

Here, We collect 3 requests in the first time-window with latencies T1, T2 and T3. At the end of this time window, we update the concurrency limit using an aggregate (average or Nth percentile) of these values and so on.

Gradient Algorithm

The Gradient algorithm differs from AIMD in that we don’t need to define a static maximum acceptable latency. It adjusts the limit by comparing latencies in a short time window (windowSize which is also the update frequency) and a long time window.

For the sake of brevity, I’ll not describe the full algorithm here, but you can head to this post for a detailed understanding and dive into the code here and here.

Test Drive

After the plugin was developed, we decided to test it in a local setup where we could artificially simulate overload conditions. We used the same test setup as the one in the previous article to show the effects of queuing.

As we saw previously, with 10 worker threads in the Spring Boot application, if we generate a load of N (<10) requests/sec, we don’t observe excess queuing and increased latencies. But with N (>10), latencies keep increasing monotonically.

Let’s repeat the same experiment (N=12) with adaptive concurrency control plugin in NGINX enabled with the following parameters:

initialConcurrencyLimit: 10
minConcurrencyLimit: 5
maxConcurrencyLimit: 20
Algorithm: AIMD
windowSize: 3 seconds
timeout (max acceptable latency): 1500 ms
percentile: 99th
backoffFactor: 0.8

When we run the load, we start to see 503s which means throttling is happening. 99th latency percentile swings between 1 and 2 seconds and we see a see-saw pattern which is characteristic of the AIMD algorithm.

The Path To Production

After testing locally, we decided to first try it with a service on our staging environment. We replaced the plain old NGINX docker image with OpenResty with Lua modules and required config. Once it was working, we load-tested it using our load generators. We saw a couple of minor issues but overall felt confident to take it to production.

Choosing The Right Configuration

For our production environment, we decided to go with AIMD first since it was simpler. One tricky aspect of AIMD is choosing the right values for minConcurrencyLimit and maxConcurrencyLimit. We needed to know the steady-state concurrency and its variation with traffic patterns.

We didn’t have any monitoring in place for in-flight requests (IFR). So we first deployed OpenResty just to measure the IFR and not do any load shedding. Here’s a simple NGINX config to achieve this:

http {
lua_package_path "/app/?.lua;;";

lua_shared_dict prometheus_metrics 1M;

init_worker_by_lua_block {

prometheus = require("prometheus").init("prometheus_metrics")
metric_ifr = prometheus:gauge(
"nginx_inflight_requests", "Number of in-flight requests")
}

include /usr/local/openresty/nginx/conf/mime.types;

server {
listen 9145;
location /metrics {
content_by_lua_block {
prometheus:collect()
}
}
}

server {
listen 80;
server_name XYZ;

location / {
access_by_lua_block {
metric_ifr:inc(1)
}

proxy_pass http://localhost:8080/;

log_by_lua_block {
metric_ifr:inc(-1)
}
}
}
}

It uses nginx-lua-prometheus to expose a gauge for IFR. After deploying this, we could observe the variations in IFR in our Prometheus monitoring and it was helpful in configuring the concurrency limit range.

Road Ahead

Currently, this module is running with some of our high-throughput services. We want to gradually roll it out to more services, but in a conservative manner. It’s extremely important for us to keep monitoring it and see how it behaves in various load conditions.

Code

The source code (Lua scripts) is available here. The repo also includes some documentation on how to set it up and sample NGINX configuration.

I have used the following third-party libraries in this code:

Disclaimer: The code provided in this repo is tested for our use case is being used in production with a few services. You are free to use it any way you like but please keep in mind that this isn’t yet battle-tested in production.

Alternatives

Although there are many open source projects that provide adaptive concurrency control, the one I have explored and am excited about is Envoy’s adaptive concurrency. Envoy is becoming more popular every day and it’s great to see it offering this functionality. Also, there are awesome new features planned by amazing folks working on Envoy. I recommend and encourage you to check it out if you are interested in this topic.

Conclusion

In my opinion, concurrency control and load shedding are indispensable stability mechanisms in a microservice architecture to prevent overload and cascading failures. If we read about how big companies (Google, Amazon, WeChat to name a few) go about this, we realize that there is much more to it than described here e.g. rather than just doing indiscriminate load shedding, we can take various factors like request priorities and tenancy into account. But even a simple mechanism can bring immense benefits in terms of system stability under load and it’s great to see that projects like Envoy are making it available for everyone.

Special thanks to Abhishek Mishra (Senior Engineer) for driving the testing and integration with services and Anmol Sachdeva (SRE) and Ankit Kohli (Engineering Manager) for their support.

--

--