Disaster Recovery Plan

Overview

The following document outlines the steps that can be taken at multiple levels of the Skpr hosting platform in the event of a disaster.

Highly Available Architecture

Traditionally, Disaster Recovery involved setting up duplicate infrastructure, in separate data centres and associated network infrastructure.

Skpr leverages high availability architecture, which is resilient to failures in the first place, avoiding the need for expensive duplicate standby infrastructure.

It is important to first understand the solution before discussing disaster recovery.

For more information see our public architecture documentation.

Tooling

These are the tools that are used to manage the Skpr platform and the application it hosts.

Terraform - Managing the lifecycle of the Skpr hosting platform.
Kubernetes Controllers - The Skpr platform automates the provisioning of application-specific resources e.g. CDN/Load Balancer/Databases/Files storage.

Backup Storage / Retention

The Skpr platform backups are categorised as either system or workflow.

System

These are backups that are only available to Skpr platform operators and execute the following retention strategy.

Type	Retention	Schedule
Daily	7 Days	Nightly
Weekly	5 weeks	First day of the week
Monthly	7 Months	First day of the month

Backups are stored in AWS Backup. A fully managed service for backup/restoration of AWS managed services.

AWS Backup targets the following managed services:

AWS Relational Database Service
Amazon EFS

Workflow

Creating a backup

skpr backup create prod

Restoring from a backup

skpr restore create stg BACKUP_ID

Scenarios

Application Level Recovery

Scenario

Data has been lost due to a code or user error in the application.

Solution

Restore the content of the site using the Skpr command-line interface.

Recovery Time Objective (RTO)

30 minutes

Time will vary depending on database and files size

Recovery Point Objective (RPO)

24hrs

Steps

The development team recovers their data from the most recent backup using the skpr restore command.
Skpr platform team assists in the restoration if required.

Platform Level Recovery

Scenario

Configuration which relates to the operation of the Skpr hosting platform has been deleted.

Solution

Reapply platform configuration.

Recovery Time Objective (RTO)

30 minutes

Recovery Point Objective (RPO)

24hrs

Steps

The Skpr platform team will reapply the configuration using infrastructure as code manifests.

Infrastructure Provider Level Recovery

Scenario

The underlying infrastructure provider (AWS) has had a catastrophic failure which results in downtime for the Skpr hosting platform. This will typically occur if the provider suffers a failure across all 3 availability zones.

Solution

Migrate the Skpr platform to a new region that is not experiencing failures.

Recovery Time Objective (RTO)

3 hours

Duration is dependent on multiple factors e.g. the ability to update DNS records.

Recovery Point Objective (RPO)

24hrs

Steps

Skpr platform team determines the best region for the new Platform. Available options include:
- Singapore - This is the closest available data centre outside of Australia.
- Melbourne - Coming Soon
The Skpr platform team will provision a new Skpr platform on the existing AWS account within a new region.
The Skpr platform team will work with customers to help migrate their applications to the newly provisioned Skpr platform.
- The Skpr platform team will move application data (database, files etc.) to the new platform.
- Clients will review the recovered sites for defects prior.
- Clients will update their DNS records to direct traffic to the new site.

Considerations

Data Sovereignty - The Skpr platform team will liaise with clients to determine if they have organizational data sovereignty rules which may prohibit the use of overseas data centres.
DNS - Clients will need to update their DNS records to the new Skpr platform as part of the migration.
Integrations - Development teams will need to test and ensure that the existing integrations with third-party services are still functioning post-migration.
Static IPs - The Skpr hosting platform provides a set of static IPs for development teams to use when implementing a defence in depth strategy when configuring external services e.g. limit requests to a set of IPs on top of an existing authentication strategy. Clients will need to update these configurations with a new set of IPs that are provisioned on the new platform.

Alternative Solution

Static Snapshot - Generate a static version of the site which requires very minimal infrastructure to operate. The Skpr platform work with the development team to generate the site and update the existing CDN to direct traffic to the temporary solution.

Security Level Recovery

Scenario

The Skpr platform team determines the security of the platform has been compromised.

The following plan does not cover tasks that will be completed as part of a security-related investigation.

Solution

Migrate the Skpr platform to a new AWS account.

Recovery Time Objective (RTO)

3 hours

Duration is dependent on multiple factors e.g. the ability to update DNS records.

Recovery Point Objective (RPO)

24hrs

Steps

Skpr platform team provision a new Skpr platform on a new AWS account.
Skpr platform team will work with customers to help migrate their applications to the newly provisioned Skpr platform.
- The Skpr platform team will move application data (database, files etc.) to the new platform.
- Clients will review the recovered sites for defects prior.
- Clients will update their DNS records to direct traffic to the new site.

Considerations

DNS - Clients will need to update their DNS records to the new Skpr platform as part of the migration.
Integrations - Development teams will need to test and ensure that the existing integrations with third-party services are still functioning post-migration.
Static IPs - The Skpr hosting platform provides a set of static IPs for development teams to use when implementing a defence in depth strategy when configuring external services e.g. limit requests to a set of IPs on top of an existing authentication strategy. Clients will need to update these configurations with a new set of IPs that are provisioned on the new platform.

Overview​

Highly Available Architecture​

Tooling​

Backup Storage / Retention​

System​

Workflow​

Scenarios​

Application Level Recovery​

Platform Level Recovery​

Infrastructure Provider Level Recovery​

Security Level Recovery​

Overview

Highly Available Architecture

Tooling

Backup Storage / Retention

System

Workflow

Scenarios

Application Level Recovery

Platform Level Recovery

Infrastructure Provider Level Recovery

Security Level Recovery