Database CPU Utilization Triage
Overview
This document provides:
- Hypothesis for what the problem might be.
- Steps to validate the hypothesis.
- Mitigations to be considered while debugging.
Hypothesis
- The database is executing a large query.
- The database being flooded with requests from the application.
Validation
Review RDS Dashboards
Review both prod and non-production RDS dashboards to determine:
- Is there a surge in connections?
- Are there any slow queries?
These CloudWatch dashboards are prefixed with:
skpr-CLUSTER-aurora-*
Review Slow Queries on the RDS Cluster
Queries will only show up in the RDS Dashboard if they have completed.
Perform the following steps to see active queries that are still running:
- Get a Skpr shell session on the appropriate environment. (
skpr shell <ENVIRONMENT>
) - Connect to the RDS cluster (
drush sqlc
if the application is based on Drupal)
Run the following commands:
show processlist;
show full processlist;
This will allow you to determine what slow queries are causing the load AND potentially provide vital information to developers as they help narrow down root cause.
Review WAF Dashboards
Review both prod and non-production WAF dashboards to determine if there is a surge in web requests.
Spikes in the WAF dashboards should correlate with spikes in the RDS dashboards.
These CloudWatch dashboards are prefixed with:
skpr-CLUSTER-waf-*
Review RDS Insights
Review the RDS insights dashboard and look for:
- Current compute workloads
- Unexpected workloads
This will show you the full scope and/or SQL commands being executed across the cluster for granular detail to help with your investigation.
Mitigation
- Can the Skpr Platform Team block specific CIDR blocks or application paths using a WAF?
- Can the Skpr Platform Team scale the aurora instance to provide more resources?
- Can the Skpr Platform Team manually kill specific slow queries using
kill <PID>;
?