Skip to content

Disaster Recovery

Overview

The following document outlines the steps that can be taken at multiple levels of the Skpr hosting platform in the event of a disaster.

Highly Available Architecture

Traditionally, Disaster Recovery involved setting up duplicate infrastructure, in separate data centres and associated network infrastructure.

Skpr leverages high availability architecture, which is resilient to failures in the first place, avoiding the need for expensive duplicate standby infrastructure.

It is important to first understand the solution before discussing disaster recovery.

For more information see our public architecture documentation.

Tooling

These are the tools that are used to manage the Skpr platform and the application it hosts.

  • Terraform - Managing the lifecycle of the Skpr hosting platform.
  • Kubernetes Controllers - The Skpr platform automates the provisioning of application-specific resources e.g. CDN/Load Balancer/Databases/Files storage.

Backup Storage / Retention

The Skpr platform backups are categorised as either system or workflow.

System

These are backups that are only available to Skpr platform operators and execute the following retention strategy.

Type Retention Schedule
Daily 7 Days Nightly
Weekly 5 weeks First day of the week
Monthly 7 Months First day of the month

Backups are stored in AWS Backup. A fully managed service for backup/restoration of AWS managed services.

AWS Backup targets the following managed services:

  • AWS Relational Database Service
  • Amazon EFS

Workflow

Creating a backup

skpr backup create prod

Restoring from a backup

skpr restore create stg BACKUP_ID

Scenarios

Application Level Recovery

Scenario

Data has been lost due to a code or user error in the application.

Solution

Restore the content of the site using the Skpr command-line interface.

Recovery Time Objective (RTO)

30 minutes

Time will vary depending on database and files size

Recovery Point Objective (RPO)

24hrs

Steps

  • The development team recovers their data from the most recent backup using the skpr restore command.
  • Skpr platform team assists in the restoration if required.

Platform Level Recovery

Scenario

Configuration which relates to the operation of the Skpr hosting platform has been deleted.

Solution

Reapply platform configuration.

Recovery Time Objective (RTO)

30 minutes

Recovery Point Objective (RPO)

24hrs

Steps

The Skpr platform team will reapply the configuration using infrastructure as code manifests.


Infrastructure Provider Level Recovery

Scenario

The underlying infrastructure provider (AWS) has had a catastrophic failure which results in downtime for the Skpr hosting platform. This will typically occur if the provider suffers a failure across all 3 availability zones.

Solution

Migrate the Skpr platform to a new region that is not experiencing failures.

Recovery Time Objective (RTO)

3 hours

Duration is dependent on multiple factors e.g. the ability to update DNS records.

Recovery Point Objective (RPO)

24hrs

Steps

  • Skpr platform team determines the best region for the new Platform. Available options include:
    • Singapore - This is the closest available data centre outside of Australia.
    • Melbourne - Coming Soon
  • The Skpr platform team will provision a new Skpr platform on the existing AWS account within a new region.
  • The Skpr platform team will work with customers to help migrate their applications to the newly provisioned Skpr platform.
    • The Skpr platform team will move application data (database, files etc.) to the new platform.
    • Clients will review the recovered sites for defects prior.
    • Clients will update their DNS records to direct traffic to the new site.

Considerations

  • Data Sovereignty - The Skpr platform team will liaise with clients to determine if they have organizational data sovereignty rules which may prohibit the use of overseas data centres.
  • DNS - Clients will need to update their DNS records to the new Skpr platform as part of the migration.
  • Integrations - Development teams will need to test and ensure that the existing integrations with third-party services are still functioning post-migration.
  • Static IPs - The Skpr hosting platform provides a set of static IPs for development teams to use when implementing a defence in depth strategy when configuring external services e.g. limit requests to a set of IPs on top of an existing authentication strategy. Clients will need to update these configurations with a new set of IPs that are provisioned on the new platform.

Alternative Solution

  • Static Snapshot - Generate a static version of the site which requires very minimal infrastructure to operate. The Skpr platform work with the development team to generate the site and update the existing CDN to direct traffic to the temporary solution.

Security Level Recovery

Scenario

The Skpr platform team determines the security of the platform has been compromised.

The following plan does not cover tasks that will be completed as part of a security-related investigation.

Solution

Migrate the Skpr platform to a new AWS account.

Recovery Time Objective (RTO)

3 hours

Duration is dependent on multiple factors e.g. the ability to update DNS records.

Recovery Point Objective (RPO)

24hrs

Steps

  • Skpr platform team provision a new Skpr platform on a new AWS account.
  • Skpr platform team will work with customers to help migrate their applications to the newly provisioned Skpr platform.
    • The Skpr platform team will move application data (database, files etc.) to the new platform.
    • Clients will review the recovered sites for defects prior.
    • Clients will update their DNS records to direct traffic to the new site.

Considerations

  • DNS - Clients will need to update their DNS records to the new Skpr platform as part of the migration.
  • Integrations - Development teams will need to test and ensure that the existing integrations with third-party services are still functioning post-migration.
  • Static IPs - The Skpr hosting platform provides a set of static IPs for development teams to use when implementing a defence in depth strategy when configuring external services e.g. limit requests to a set of IPs on top of an existing authentication strategy. Clients will need to update these configurations with a new set of IPs that are provisioned on the new platform.