Skip to content

5xx Alert Triage

Overview

This document has been written to assist both development and Skpr operations team members in triaging elevated 5xx response codes at the load balancer layer.

This document provides:

  • Questions that teams should be asking while debugging
  • Steps to debug while asking the questions
  • Mitigations to be considered while debugging

Questions

  • Is there an elevation in traffic levels?
    • Is the traffic being generated by a single or list of IPs?
    • Is the traffic all targetting a single or specific list of pages?
  • Is there a dip in CDN Cache/HIT ratios?
  • Is there an elevation in 404 (Page Not Found) levels?
  • Is there an elevation in 403 (Access Denied) levels?
  • Was a new feature rolled out recently?
  • Is there resource contention? eg. Hitting scaling limits at the application or services level.

Steps to Debug

Review Application Dashboard Widgets

How to access the Application Dashboard:

Direct link from Dashboard Coming Soon

  • Log into the AWS Console
  • Navigate to CloudWatch
  • Browse the Dashboards Section
  • Locate your application Dashboard
    • The name is in the format: CLUSTER-PROJECT-ENVIRONMENT

Dashboard Widgets of Interest:

  • ALB: Response Codes - The metric which triggered this alert.
  • CloudFront: Requests - Used to determine if there is an elevation in requests.
  • CloudFront: Cache Hit Ratios - Used to determine if there is a dip in caching.
  • CloudFront: 4xx Errors - Used to determine if requests are requesting content which does not exist.
  • Number of Instances - Used to determine if a scaling event is occuring.

Review Application Logs

Go to the Application Dashboard.

Review the following sections for trends:

  • Alert, Critical and Emergency
  • Error and Warning

Review Slow Query Logs

Slow Query Logs Coming Soon

Mitigations

  • Can the Development Team disable a feature?
  • Can the Development Team roll back a release?
  • Can the Skpr Platform Team block an IP or User Agent?
  • Can the Skpr Platform Team temporarily upsize the infrastructure?