Triaging Image Pull Errors
Problem identification
In order to identify the affected resources, you can either filter by ErrImagePull
or ImagePullBackOff
, which is
the state that appears after the resource stops trying to pull the image after a given threshold of times on a
configured interval. You can do all of this using kubectl
.
$ kubectl get all --all-namespaces | grep 'ErrImagePull|ImagePullBackOff'
NAMESPACE NAME READY STATUS RESTARTS AGE
mynamespace pod/mypod-1 0/1 ErrImagePull 0 8m53s
mynamespace pod/mypod-2 0/1 ImagePullBackOff 0 8h
Hypothesis
- The docker image may not be available for download, or it may not exist.
- There may be an issue with authentication, which is preventing a successful pull.
Debugging
Reviewing the Cluster Metrics Dashboard
Reviewing this dashboard will give the user an understanding of the scope of the impact. Is it just one metric data-point or is it many; are there trends or other considerations to be made?
- Log into the AWS Console
- Navigate to CloudWatch
- Browse the Dashboards Section
- Locate the cluster metrics dashboard
- Locate the dashboard widget showing the ErrImagePull metrics
- Identify how many errors there are, and what trends are showing.
Are there any pod logs or events to help identify the issue?
$ kubectl -n mynamespace logs pod/mypod
Error from server (BadRequest): container "mycontainer" in pod "mypod-1" is waiting to start: trying and failing to pull image
$ kubectl -n mynamespace describe pod/mypod-1
State: Waiting
Reason: ImagePullBackOff
...
Warning Failed 3m57s (x4 over 5m28s) kubelet Error: ErrImagePull
Warning Failed 3m42s (x6 over 5m28s) kubelet Error: ImagePullBackOff
Normal BackOff 18s (x20 over 5m28s) kubelet Back-off pulling image "failed-image"
What is the docker image reference?
To identify the image references, look at the spec.
kubectl -n mynamespace get pod/mypod-1 -o json | jq -r .spec.containers[].image
Is the docker image available for download, or does it not exist?
Make sure the image exists - it is possible that the image reference is simply incorrect, invalid, or maybe the image or the specific tag does not exist.
$ curl -sL "https://registry.hub.docker.com/v2/repositories/${IMAGE_REF}/tags/" | jq '."results"[]["name"]' -r | sort
latest
tag_1
tag_2
tag_3
Is there an issue with authentication, which is preventing a successful pull?
Check the credentials and the API key. Describing the pod will also be indicative of an authorization failure.
$ kubectl -n mynamespace describe pod/mypod-1
State: Waiting
Reason: ImagePullBackOff
...
Warning Failed 3m57s (x4 over 5m28s) kubelet Error: ErrImagePull
Warning Failed 3m42s (x6 over 5m28s) kubelet Error: ImagePullBackOff
Normal BackOff 18s (x20 over 5m28s) kubelet Back-off pulling image "authorization failed"
Mitigations
- Can the Development Team rebuild the project to make the docker images available?
- Can the Skpr Platform Team able to fix the authentication issue using a token, creds or IAM/ECR policy updates