ElasticSearch Memory Utilization Alert Triage
Overview
This document provides:
- Hypothesis for what the problem might be.
- Steps to validate the hypothesis.
- Mitigations to be considered while debugging.
Hypothesis
- The load being generated by a large query.
- The instance does not have sufficient memory.
- The instance is not running garbage collection frequently enough.
Validation
Metrics
- Log into the AWS Console
- Navigate to OpenSearch Dashboard
- Select the OpenSearch cluster that is relevant to your alert
- Select the "Cluster Health" tab
Larger queries
The metric "Search Latency" and/or "Indexing Latency" in the "Cluster Health" tab in isolation will indicate if the queries performed are correlated to the memory increase. This would present an opportunity to optimize the application or the data structures.
Does the cluster have enough memory?
This can be identified using the "Maximum memory Utilization" metric in the "Cluster Health" tab. If this metric is not able to reach its desired values - effectively reaching a flat line on 100% utilization, this means the memory has reached a bottleneck.
Is garbage collection adequate?
The metrics "Old Collection" and "Old Collection Time" will tell you how much is allowed to stay in memory before triggering garbage collection, and how frequently it is run to meet this requirement. If the behaviour is not meeting the needs of the application, this can be adjusted to meet those needs.
Logs
There is a logs tab in the bottom section of the OpenSearch cluster instance dashboard which will link you to the following log groups in CloudWatch Log Insights. These can also help to identify malformed or malicious requests.
- Search slow logs
- Index slow logs
- Error logs
- Audit logs
Mitigation
- Can the Skpr Platform Team block any undesired, malicious or malformed traffic?
- Can the Development Team disable a recently released feature generating these workloads?
- Can the Skpr Platform Team scale the OpenSearch clusters memory?
- Can the Skpr Platform Team adjust the garbage collection policies to be more regular?