Database Storage Triage

Overview

This document has been written to assist both development and Skpr operations team members in triaging storage fluctuations in the database.

This document provides:

Hypothesis for what the problem might be.
Steps to validate the hypothesis.
Mitigations to be considered while debugging.

Hypothesis

A large spike in content has been added to the database.
A single project/feature has caused a spike in database storage.

Validation

Review Application Dashboard Widgets

The database features several interesting metrics if you have access to the dashboard. For now, this feature is available to dedicated clusters and non-inclusive of shared infrastructure. The name of this dashboard is CLUSTER-aurora-INSTANCE.

How to access the database Dashboard:

Direct link from Dashboard Coming Soon

Log into the AWS Console
Navigate to CloudWatch
Browse the Dashboards Section
Locate your database Dashboard

Dashboard Widgets of Interest

Database Size: Useful for finding the total database size over time

Taking a look over the past month, can you determine if the change is gradual or suspiciously high? You may need to reach out to Skpr Operations staff if the line is gradually increasing over time.

Insert Throughput: Useful to find out how much data is going into the database

Blobs, or non-binary data will show much higher spikes in activity. Can any spikes be identified and are there any indications this is related to a business activity in the application?

Delete Throughput: Useful to find out how much data is coming out of the database

This graph will indicate that data which has been loaded into the database is successfully being purged, so ideally you will see spikes and patterns correlating to the insert throughput. A lack of data here will indicate that information is not being removed or processed out of the database, driving storage requirements.

Which databases and tables are taking up the most storage?

By identifying the database that is the largest on the cluster and finding the largest tables, a plan can be approached based on what is normal for the application. The approach may be to clean or purge a table, or application development may be required to remediate a bug or workflow.

Finding which database is taking up the most storage

In order to identify the larger databases, you will need to connect to the RDS cluster and run the following query. This will list all the databases and their usage size on disk.

SELECT table_schema AS "Database", 
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS "Size (MB)" 
FROM information_schema.TABLES 
GROUP BY table_schema
ORDER BY SUM(data_length + index_length) DESC;

Finding which tables in a database are taking up the most storage

After you've determined the size of the databases, you can work out which tables are larger than expectation. You can run this query while you are connected to RDS or from your Skpr project.

Get a Skpr shell session on the appropriate environment. (skpr shell <ENVIRONMENT>)
Connect to the RDS cluster (drush sql:cli if the application is based on Drupal)

Run the following query:

SELECT TABLE_NAME AS `Table`,
ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024) AS `Size (MB)`
FROM information_schema.TABLES
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;

Mitigations

Can the Skpr Platform Team modify the configuration of the alert if the growth of the database is organic?
Can the Development Team fix an issue in the application causing the storage increases?

Overview​

Hypothesis​

Validation​

Review Application Dashboard Widgets​

Dashboard Widgets of Interest​

Which databases and tables are taking up the most storage?​

Finding which database is taking up the most storage​

Finding which tables in a database are taking up the most storage​

Mitigations​