ElasticSearch CPU Utilization Alert Triage
Overview
This document provides:
- Hypothesis for what the problem might be.
- Steps to validate the hypothesis.
- Mitigations to be considered while debugging.
Hypothesis
- The application is experiencing a high amount of traffic.
- The load is being generated by a large query.
- The ElasticSearch instance is performing large indexing operations.
- The ElasticSearch instance is overloaded.
Validation
Metrics
- Log into the AWS Console
- Navigate to OpenSearch Dashboard
- Select the OpenSearch cluster that is relevant to your alert
- Select the "Cluster Health" tab
Checking the queries on the instance
The following metrics are useful for determining operation type, frequency and complexity:
- Indexing Data Rate: Measuring the speed data is indexed. This should ideally be high when the application is experiencing large amounts of requests so that requests are being efficiently progressed. When it is low, the service is running in a degraded state.
- Search Rate: Measuring the amount of read-only requests being made. A high metric is healthy because it indicates efficiency in managing search operations in the cluster.
- Write Thread Pool: Data being processed which includes adding, updating and deleting documents being stored. It can also include requests that are part of a pool, as well as rejected requests when the system is overloaded and cannot process incoming requests.
- Index Thread Pool: Index operations are similar to write requests but focus specifically on indexing operations.
- Search Thread Pool: Measuring read-only search operations, this metric will show the relationship between active threads and rejected requests when the system is overloaded. It will also show requests remaining in the queue scheduled to be processed.
- Merge Thread Pool: This metric is similar to the Write Thread Pool, except it focuses on merge operations. Once more, it shows the active thread count, queued requests and rejected requests when the system is overloaded.
Logs
There is a logs tab in the bottom section of the OpenSearch cluster instance dashboard which will link you to the following log groups in CloudWatch Log Insights. These can also help to identify malformed or malicious requests.
- Search slow logs
- Index slow logs
- Error logs
- Audit logs
Checking if the cluster is overloaded
The metrics CPU Utilization
and JVM Memory Pressure
in the Cluster Health dashboard, when consistently high indicate
that resource limitations are being strained.
Mitigation
- Can the Skpr Platform Team block any undesired, malicious or malformed requests?
- Can the Development Team disable a recently released feature generating these workloads?
- Can the Development Team review field mappings to reduce unnecessary workloads?
- Can the Development Team optimize the applications workload, for example queueing, caching, etc.?
- Can the Skpr Platform Team scale the OpenSearch cluster?