5xx Alert Triage
Overview
This document has been written to assist both development and Skpr operations team members in triaging elevated 5xx response codes at the load balancer layer.
This document provides:
- Questions that teams should be asking while debugging
- Steps to debug while asking the questions
- Mitigations to be considered while debugging
Questions
- Is there an elevation in traffic levels?
- Is the traffic being generated by a single or list of IPs?
- Is the traffic all targetting a single or specific list of pages?
- Is there a dip in CDN Cache/HIT ratios?
- Is there an elevation in 404 (Page Not Found) levels?
- Is there an elevation in 403 (Access Denied) levels?
- Was a new feature rolled out recently?
- Is there resource contention? eg. Hitting scaling limits at the application or services level.
Steps to Debug
Review Application Dashboard Widgets
How to access the Application Dashboard:
Direct link from Dashboard Coming Soon
- Log into the AWS Console
- Navigate to CloudWatch
- Browse the Dashboards Section
- Locate your application Dashboard
- The name is in the format:
CLUSTER-PROJECT-ENVIRONMENT
- The name is in the format:
Dashboard Widgets of Interest:
- ALB: Response Codes - The metric which triggered this alert.
- CloudFront: Requests - Used to determine if there is an elevation in requests.
- CloudFront: Cache Hit Ratios - Used to determine if there is a dip in caching.
- CloudFront: 4xx Errors - Used to determine if requests are requesting content which does not exist.
- Number of Instances - Used to determine if a scaling event is occuring.
Review Application Logs
Go to the Application Dashboard.
Review the following sections for trends:
- Alert, Critical and Emergency
- Error and Warning
Review Slow Query Logs
Slow Query Logs Coming Soon
Mitigations
- Can the Development Team disable a feature?
- Can the Development Team roll back a release?
- Can the Skpr Platform Team block an IP or User Agent?
- Can the Skpr Platform Team temporarily upsize the infrastructure?