ElasticSearch CPU Utilization Alert Triage

Overview

This document provides:

Hypothesis for what the problem might be.
Steps to validate the hypothesis.
Mitigations to be considered while debugging.

Hypothesis

The application is experiencing a high amount of traffic.
The load is being generated by a large query.
The ElasticSearch instance is performing large indexing operations.
The ElasticSearch instance is overloaded.

Validation

Metrics

Log into the AWS Console
Navigate to OpenSearch Dashboard
Select the OpenSearch cluster that is relevant to your alert
Select the "Cluster Health" tab

Checking the queries on the instance

The following metrics are useful for determining operation type, frequency and complexity:

Indexing Data Rate: Measuring the speed data is indexed. This should ideally be high when the application is experiencing large amounts of requests so that requests are being efficiently progressed. When it is low, the service is running in a degraded state.
Search Rate: Measuring the amount of read-only requests being made. A high metric is healthy because it indicates efficiency in managing search operations in the cluster.
Write Thread Pool: Data being processed which includes adding, updating and deleting documents being stored. It can also include requests that are part of a pool, as well as rejected requests when the system is overloaded and cannot process incoming requests.
Index Thread Pool: Index operations are similar to write requests but focus specifically on indexing operations.
Search Thread Pool: Measuring read-only search operations, this metric will show the relationship between active threads and rejected requests when the system is overloaded. It will also show requests remaining in the queue scheduled to be processed.
Merge Thread Pool: This metric is similar to the Write Thread Pool, except it focuses on merge operations. Once more, it shows the active thread count, queued requests and rejected requests when the system is overloaded.

Logs

There is a logs tab in the bottom section of the OpenSearch cluster instance dashboard which will link you to the following log groups in CloudWatch Log Insights. These can also help to identify malformed or malicious requests.

Search slow logs
Index slow logs
Error logs
Audit logs

Checking if the cluster is overloaded

The metrics CPU Utilization and JVM Memory Pressure in the Cluster Health dashboard, when consistently high indicate that resource limitations are being strained.

Mitigation

Can the Skpr Platform Team block any undesired, malicious or malformed requests?
Can the Development Team disable a recently released feature generating these workloads?
Can the Development Team review field mappings to reduce unnecessary workloads?
Can the Development Team optimize the applications workload, for example queueing, caching, etc.?
Can the Skpr Platform Team scale the OpenSearch cluster?

Overview​

Hypothesis​

Validation​

Metrics​

Checking the queries on the instance​

Logs​

Checking if the cluster is overloaded​

Mitigation​