Skip to main content

ElastiCache (Redis) CPU Utilization Alert Triage

Overview

This document provides:

  • Hypothesis for what the problem might be.
  • Steps to validate the hypothesis.
  • Mitigations to be considered while debugging.

Hypothesis

  • The Redis instance is competing for system resources.
  • The Redis instance is running a load with complex operations.
  • The Redis instance is experiencing a large amount of traffic.

Validation

Finding Redis Metrics

Many of the sections below will cover metrics to describe certain behaviour, in order to see these metrics, you can follow the directions below:

  • Log into the AWS Console
  • Navigate to Redis OOS Caches
  • Select the cluster that is relevant to your alert
  • Select the metrics tab

Finding how many evictions there are

Evictions for documents that haven't yet expired are indicative of the system suffering memory issues. The instance will attempt to react to the limitations, to accommodate the load. As a result, we can expect a Redis cluster to increase in CPU utilisation when these constraints are being contested with. This metric is simply known as Evictions and it can also be found on the second page of the Metrics tab on the Redis dashboard.

Identifying the type of loads

Metrics for Redis are granular enough for you to determine what type of operation is being performed - either a simple or complex operation. This allows you to correlate the operations based on their category with the increase in CPU utilization.

Complex operations can also result in the application receiving other errors when the execution can't be completed within configured timeframes.

The table below will help you to categorize the workload:

Graph NameMetric NameCategory
Get Type Command CountGetTypeCmdsSimple
String Based Command CountStringBasedCmdsSimple
Key Based Command CountKeyBasedCmdsSimple
Set Based Command CountSetTypeCmdsComplex
Hash Based Command CountHashBasedCmdsComplex
List Based Command CountListBasedCmdsComplex
Set Based Command CountSetBasedCmdsComplex
Stream Based Command CountStreamBasedCmdsVery Complex
Sorted Set Based Command CountSortedSetBasedCmdsVery Complex

Mitigation

  • Can the Skpr Platform Team implement a read-replica to the ElastiCache cluster?
  • Can the Skpr Platform Team scale the Aurora cluster horizontally or vertically to accommodate the load?