Stressed Applications Create Developer Stress

Many of our recent projects have utilized scaled architecture to accommodate high intensity usage and long term uptime under variable loads. These use-case parameters have challenged my perspective regarding application reliance on supporting technologies. Any meaningful application interacts with and depends on other technologies to perform real world tasks. As environment stresses such as high intensity usage or long term uptime increase, the network integration of these tools becomes stressed.

Key Application Infrastructure: Database

The most common support technology for any application is the database. Somehow the application is going to be storing information for the application. For applications with  high-intensity parameters, the databases can either be a monolithic single-point structure or horizontally scaled integrations (think web-scale). Regardless of which solution is chosen, it is vitally important that the application is able to maintain its connection to the database in the midst of high-end stress (heavy load, long term uptime, etc).

In short, we want our application to be able to:

  1. connect to the database
  2. perform all expected operations between itself and the database
  3. maintain connection as best it can when the database is stressed
  4. persist its own processes efficiently when maintaining the database connection is difficult

In short, the application has to do its job, but while doing that job, it may have to care for its database when it's stressed.

Oversimplified Solutions

This is where the knee jerk response, "Throw more hardware at it!", happens.

There are certainly times where that is indeed the right solution. This is certainly the right solution when degraded performance is observed during normal to medium-high intensity usage or during short-to-medium term support. These conditions are considered the safe zone for the overall application ecosystem and should not experience degraded performance of any kind.

The situations that I am addressing are high-end system stress cases, where the usage is considered rarely high for the application or the uptime is well beyond a normal maintenance schedule. These are cases where the application has to go the extra mile to keep the application running as smoothly as possible by catering to both itself and the other stressed components of the ecosystem. Throwing more hardware in these cases is not frugal to the application stake holders. It is better to develop the application in a robust way to handle these situations effectively without creating excessive hardware overhead. These cases are meant to be rare. If the application is experiencing high-end stress cases regularly, it is an indication that the overall usage parameters have moved upwards and that the base architecture of the application should reflect this shift as normal, not rare. This again would be a case where more hardware would be the right solution.

For this post, I'm addressing rare high-end stress cases specifically.

The Layout

I'm going to present a reasonable layout that some of our recent applications have utilized because it has application and database scaling built into it.

Here's the (simplified) infrastructure:

  1. application hosted on a Docker solution (Swarm, Kubernetes)
  2. Couchbase NoSQL database (scalable)

In this setup, the Docker solution enables the application itself to scale. Using the Docker management tool, the application manager can scale up or down the number of instances of the application itself. Couchbase, like other NoSQL solutions, is designed to be horizontally scaled. Hardware can easily be added or expanded to continuously expand its capability to meet the needs of the application. Both the Docker and Couchbase scaling abilities are manual, but they enable us to make an adaptive application ecosystem that can scale up or down in response to the needs of the overall system.

Signs of Stress

The question is, what happens when the database comes under stress? Degraded performance for certain, broken experience in the worst cases. But what are the signs of stress and how can the application work with it?

Each database speaks its own language in terms of not only data-management, but also database state and stress indicators. I chose Couchbase specifically for this post because it is one of the database solutions that I have had experience with managing under stress and I am familiar with many of its unique stress signs.

Couchbase shares codes as part its errors. These codes are translated slightly different based on language SDKs.
My recent applications are written in NodeJS. The NodeJS SDK throws a few key codes during different stress situations:

  • Code 11: "temporary error", either Couchbase is not able to take requests temporarily.
  • Code 16: "network error". This can be a DNS issue or the Couchbase going off-line. The SDK is reporting that it can't find Couchbase on the network.
  • Code 23: "timed out". Indicates that Couchbase didn't respond within the allotted time frame.

What we want to do is capture these specific areas and handle them within the application responsibly to attempt to maintain a persistent connection through a stress event/period.

NodeJS Application Specifics

Most of my recent applications have been built on NodeJS, they are using a NodeJS-specific SDK and NodeJS tooling to handle application life cycles.
NodeJS does not have an input/output buffer - it queues requests from entering the event loop. This feature aids degraded performance issues, replacing a broken experience with a delayed one. For the problem of dealing with degraded Couchbase performance, this feature allows us to manage the issue rather than completely aborting an application process.

Our plan is to delay the application life cycle responsibly to manage the Couchbase performance issue.

The process that I've adopted is as follows:

  1. vet the error type (error code detection)
  2. immediately return all non-connection error types to the application life cycle (example: a "document not found" error)
  3. disconnect from Couchbase if there is a live connection (caveat: most databases, including Couchbase, don't "forget" open connections)
  4. attempt to reconnect to Couchbase a limited number of times (retry pattern described below)
  5. delay a reconnection attempt for a Code 11: "temporary error" (may be a Couchbase re-balance or node warmup, so we wait a short time to let it recover)

The Reconnect Pattern

Here's a code sample of the reconnect pattern:

The reconnectBucket() function returns a Promise that resolves to null when the reconnection works or a custom error when it fails.
The operation variable utilizes an NodeJS package named "retry" that will attempt an algorithm, catch errors, and reattempt the algorithm using settings that determine how many times to repeat and how long to wait between attempts (represented by the reconnectRetrySetttings variable set above). If the conditions reach their limit (operation.retry(err)), then the process will abort with the custom error.
The settings in this example indicate that retry will make 10 attempts starting at 1,000 milliseconds (one second), increasing that time by a factor of 2 until it reaches a maximum of 10,000 milliseconds (ten seconds) and it will not randomize those time values.
The delay variable instantiated inside the algorithm, sets a 500 millisecond delay that the setTimeout() wrapper will use to delay the reconnection attempt during an error Code:11 reconnection attempt.
This gives Couchbase some space to recover from whatever temporary situation has thrown the error (usually a re-balance or node warmup).

This pattern is incorporated as part of the Couchbase SDK service module we write into our applications. The functions of that module integrate it by calling reconnectBucket() when errors with the predetermined codes are thrown (11, 16, and 23 in this case). They then resume their operations if the reconnectBucket() Promise resolves or bubble the error back up the application's event loop.

Incorporating the Reconnect Pattern

Here's an example of that pattern at work in a simple "get document" function:

The getDoc() function is a generic function used to get a single document from Couchbase based on its document name (docName).
The function is Promisified, returning either the document value (return resolve(couchbaseResponse.value)) or any indicated error from the various error handling steps.
The error handling is where reconnectBucket() gets inserted. If the error returned from bucket.get() has a code of 11, 15, or 23, we call reconnectBucket() to try and fix the connection issue with Couchbase.
As indicated in the above section, reconnectBucket() handles the retry process internally and it will return either a resolved Promise or an error based on how that goes.
If reconnectBucket() resolves in the .then(), getDoc() will reattempt from the top.
If reconnectBucket() throws an error into .catch(), getDoc() will reject that error into it's own Promise and bubble that error back to the requesting source of getDoc().

Persistent Applications In the Wild

By using this pattern, we are able to prevent Couchbase performance issues from immediately affecting the application's life cycle adversely. The requests will be delayed, but not canceled.

This reconnect pattern has allowed some of our most recent applications to persist their database connections smoothly abnormally high-intensive stress events. After experiencing these events our team analyzed the high-intensive stress event to see if it was indicative or reasonably normal activity for the application environment. If it was, we would increase scalability to address the rising need. If it was not, we were grateful for the persistent database connection design for carrying us through the event.

This was just one example of how this pattern can be used in a scalable application environment. The ideas are universal, the technology specific.

Thank you.