Triaging Pod OOMKilled Errors

Overview

This document has been written to assist Skpr operations team members in triaging events identifying Pods which have been killed due to memory issues.

This document provides:

Questions that teams should be asking while debugging
Steps to debug while asking the questions
Mitigations to be considered while debugging

Problem identification

In order to identify the affected resources, you can filter by OOMKilled. This can also be identified using the exit code available in the Pod logs, which for this error is 137.

OOMKilled indicates the memory is being constrained for the pod.

$ kubectl get all --all-namespaces | grep 'OOMKilled'
NAMESPACE    NAME          READY   STATUS        RESTARTS   AGE
mynamespace  pod/mypod-1   0/1     OOMKilled     0          8m53s
mynamespace  pod/mypod-2   0/1     OOMKilled     0          8h

Hypothesis

A Drush command being run does not have enough memory due to the subprocess occupying the resources.
Requests and limits on the Pod were not correctly optimised.
The Pod was exceeding its memory thresholds before it was killed.
The Node does not have enough system resources.

Debugging

Reviewing Logs and Metrics

How to access the Application Dashboard:

Direct link from Dashboard Coming Soon

Log into the AWS Console
Navigate to CloudWatch
Browse the Dashboards Section
Locate your application Dashboard
- The name is in the format: CLUSTER-PROJECT-ENVIRONMENT
Review the Command Line (Cron / Shell) section for possible leads.
Review the application memory utilization for any spikes.

Mitigations

Can the Development Team rollback the application to a version that did not have this issue?
Can the Skpr Platform Team adjust the memory limitations associated to the application?
Can the Skpr Platform Team adjust the memory limitations associated to the node?

Overview​

Problem identification​

Hypothesis​

Debugging​

Reviewing Logs and Metrics​

Mitigations​