ElasticSearch 5xx Alert Triage
Overview
This document provides:
- Hypothesis for what the problem might be.
- Steps to validate the hypothesis.
- Mitigations to be considered while debugging.
Hypothesis
- An instance has become unhealthy.
- A primary shard has not been allocated, or is unavailable.
- The cluster is experiencing issues with resource limitations.
- Malicious or malformed requests are being made.
- A large amount of requests are being made, triggering the instance to throttle.
Validation
Finding Cluster Health, metrics and Logs
The sections below will cover metrics to describe certain behaviour, in order to see these metrics, you can follow the directions below:
- Log into the AWS Console
- Navigate to OpenSearch Dashboard
- Select the OpenSearch cluster that is relevant to your alert
Cluster health
The section labelled "General Information" at the top of the page will include three components to indicate health:
- Domain processing state
- Configuration change status
- Cluster Health
Metrics
A suite of metrics are available as part of OpenSearch in the dashboard. Navigate to the "Cluster health" tab and find the metrics which may be of interest, some of which will include:
- Cluster health: A histogram of cluster health to show if the status was red, yellow or green recently. Yellow for example is an indicator that primary shards are allocated, and replica shards are not allocated at all. Red can indicate a primary shard is not available - indicating it may be misconfigured or not available.
- Total nodes: A histogram showing recent scaling events.
- HTTP requests by response code: A histogram of all requests made to the instance including their status codes. incoming requests here resulting in 2xx or 3xx responses will indicate there is no network outage.
- CPU Utilization: A histogram of CPU utilization. An indicator of resource limitations would be if this metric was consistently atr 100%.
- JVM Memory Pressure: A histogram of memory pressure. When this pressure is high enough to exceed the available memory of the instance, the application will experience 5xx errors.
Some other metrics can be found by viewing CloudWatch Metrics. To find them, search for them after Navigating to the
CloudWatch service and clicking "All metrics" on the dashboard navigation page. You will find them all under the
ES/OpenSearchService
namespace.
- ThroughputThrottle: When this metric spikes past
1
, requests are being throttled due to EBS volume limitations which leads to the application showing 5xx errors.
Logs
There is a logs tab in the bottom section of the OpenSearch cluster instance dashboard which will link you to the following log groups in CloudWatch Log Insights. In the case of a 5xx investigation, you should start by looking at the error logs. These can also help to identify malformed or malicious requests.
- Search slow logs
- Index slow logs
- Error logs
- Audit logs
Mitigation
- Can the Development Team apply a fix to the application which would prevent the query from being run?
- Can the Skpr Platform Team block unwanted requests with the Web Application Firewall?
- Can the Skpr Platform Team scale the OpenSearch cluster to meet the resource requirements for the cluster?
- Can the Skpr Platform Team adjust the shard allocation rules in order to allocate available shards?
- Can the Skpr Platform Team terminate a bad compute instance?