ElasticSearch Storage Utilization Alert Triage
Overview
This document provides:
- Hypothesis for what the problem might be.
- Steps to validate the hypothesis.
- Mitigations to be considered while debugging.
Hypothesis
- The cluster is storing a large number of documents.
- The cluster still storing large amounts of deleted documents queued for garbage collection.
- The cluster has not been allocated an insufficient amount of storage.
Validation
Metrics
- Log into the AWS Console
- Navigate to OpenSearch Dashboard
- Select the OpenSearch cluster that is relevant to your alert
- Select the "Cluster Health" tab
Removal of storage
Before documents are actually removed, they enter a temporary phase where they accumulate for garbage collection policies to remove the data. You can check this with the metric "Deleted Documents".
Storage of Documents and indices
The metric "Indexing Rate" will show how much storage is being used for documents and indices, and gives an overall impression of storage utilization, and this can also be a sign that not enough storage is available.
Logs
There is a logs tab in the bottom section of the OpenSearch cluster instance dashboard which will link you to the following log groups in CloudWatch Log Insights. These can also help to identify malformed or malicious requests.
- Search slow logs
- Index slow logs
- Error logs
- Audit logs
Mitigation
- Can the Skpr Platform Team block any undesired, malicious or malformed traffic?
- Can the Development Team disable a recently released feature generating these workloads?
- Can the Development Team delete an unnecessary index?
- Can the Skpr Platform Team adjust the garbage collection policy to be more frequent to meet the application needs?
- Can the Skpr Platform Team scale the storage of the cluster?