ElasticSearch Storage Utilization Alert Triage

Overview

This document provides:

Hypothesis for what the problem might be.
Steps to validate the hypothesis.
Mitigations to be considered while debugging.

Hypothesis

The cluster is storing a large number of documents.
The cluster still storing large amounts of deleted documents queued for garbage collection.
The cluster has not been allocated an insufficient amount of storage.

Validation

Metrics

Log into the AWS Console
Navigate to OpenSearch Dashboard
Select the OpenSearch cluster that is relevant to your alert
Select the "Cluster Health" tab

Removal of storage

Before documents are actually removed, they enter a temporary phase where they accumulate for garbage collection policies to remove the data. You can check this with the metric "Deleted Documents".

Storage of Documents and indices

The metric "Indexing Rate" will show how much storage is being used for documents and indices, and gives an overall impression of storage utilization, and this can also be a sign that not enough storage is available.

Logs

There is a logs tab in the bottom section of the OpenSearch cluster instance dashboard which will link you to the following log groups in CloudWatch Log Insights. These can also help to identify malformed or malicious requests.

Search slow logs
Index slow logs
Error logs
Audit logs

Mitigation

Can the Skpr Platform Team block any undesired, malicious or malformed traffic?
Can the Development Team disable a recently released feature generating these workloads?
Can the Development Team delete an unnecessary index?
Can the Skpr Platform Team adjust the garbage collection policy to be more frequent to meet the application needs?
Can the Skpr Platform Team scale the storage of the cluster?

Overview​

Hypothesis​

Validation​

Metrics​

Removal of storage​

Storage of Documents and indices​

Logs​

Mitigation​