Triaging Pod OOMKilled Errors
Overview
This document has been written to assist Skpr operations team members in triaging events identifying Pods which have been killed due to memory issues.
This document provides:
- Questions that teams should be asking while debugging
- Steps to debug while asking the questions
- Mitigations to be considered while debugging
Problem identification
In order to identify the affected resources, you can filter by OOMKilled
. This can also be identified using the
exit code available in the Pod logs, which for this error is 137
.
OOMKilled
indicates the memory is being constrained for the pod.
$ kubectl get all --all-namespaces | grep 'OOMKilled'
NAMESPACE NAME READY STATUS RESTARTS AGE
mynamespace pod/mypod-1 0/1 OOMKilled 0 8m53s
mynamespace pod/mypod-2 0/1 OOMKilled 0 8h
Hypothesis
- A Drush command being run does not have enough memory due to the subprocess occupying the resources.
- Requests and limits on the Pod were not been correctly optimised.
- The Pod was exceeding its memory thresholds before it was killed.
- The Node does not have enough system resources.
Debugging
Reviewing Logs and Metrics
How to access the Application Dashboard:
Direct link from Dashboard Coming Soon
- Log into the AWS Console
- Navigate to CloudWatch
- Browse the Dashboards Section
- Locate your application Dashboard
- The name is in the format:
CLUSTER-PROJECT-ENVIRONMENT
- The name is in the format:
- Review the Command Line (Cron / Shell) section for possible leads.
- Review the application memory utilization for any spikes.
Mitigations
- Can the Development Team rollback the application to a version that did not have this issue?
- Can the Skpr Platform Team adjust the memory limitations associated to the application?
- Can the Skpr Platform Team adjust the memory limitations associated to the node?