Skip to content

ElastiCache (Redis) CPU Utilization Alert Triage

Overview

This document provides:

  • Hypothesis for what the problem might be.
  • Steps to validate the hypothesis.
  • Mitigations to be considered while debugging.

Hypothesis

  • The Redis instance is competing for system resources.
  • The Redis instance is running a load with complex operations.
  • The Redis instance is experiencing a large amount of traffic.

Validation

Finding Redis Metrics

Many of the sections below will cover metrics to describe certain behaviour, in order to see these metrics, you can follow the directions below:

  • Log into the AWS Console
  • Navigate to Redis OOS Caches
  • Select the cluster that is relevant to your alert
  • Select the metrics tab

Finding how many evictions there are

Evictions for documents that haven't yet expired are indicative of the system suffering memory issues. The instance will attempt to react to the limitations, to accommodate the load. As a result, we can expect a Redis cluster to increase in CPU utilisation when these constraints are being contested with. This metric is simply known as Evictions and it can also be found on the second page of the Metrics tab on the Redis dashboard.

Identifying the type of loads

Metrics for Redis are granular enough for you to determine what type of operation is being performed - either a simple or complex operation. This allows you to correlate the operations based on their category with the increase in CPU utilization.

Complex operations can also result in the application receiving other errors when the execution can't be completed within configured timeframes.

The table below will help you to categorize the workload:

Graph Name Metric Name Category
Get Type Command Count GetTypeCmds Simple
String Based Command Count StringBasedCmds Simple
Key Based Command Count KeyBasedCmds Simple
Set Based Command Count SetTypeCmds Complex
Hash Based Command Count HashBasedCmds Complex
List Based Command Count ListBasedCmds Complex
Set Based Command Count SetBasedCmds Complex
Stream Based Command Count StreamBasedCmds Very Complex
Sorted Set Based Command Count SortedSetBasedCmds Very Complex

Mitigation

  • Can the Skpr Platform Team implement a read-replica to the ElastiCache cluster?
  • Can the Skpr Platform Team scale the Aurora cluster horizontally or vertically to accommodate the load?