Disaster Recovery
Overview
The following document outlines the steps that can be taken at multiple levels of the Skpr hosting platform in the event of a disaster.
Highly Available Architecture
Traditionally, Disaster Recovery involved setting up duplicate infrastructure, in separate data centres and associated network infrastructure.
Skpr leverages high availability architecture, which is resilient to failures in the first place, avoiding the need for expensive duplicate standby infrastructure.
It is important to first understand the solution before discussing disaster recovery.
For more information see our public architecture documentation.
Tooling
These are the tools that are used to manage the Skpr platform and the application it hosts.
- Terraform - Managing the lifecycle of the Skpr hosting platform.
- Kubernetes Controllers - The Skpr platform automates the provisioning of application-specific resources e.g. CDN/Load Balancer/Databases/Files storage.
Backup Storage / Retention
The Skpr platform backups are categorised as either system or workflow.
System
These are backups that are only available to Skpr platform operators and execute the following retention strategy.
Type | Retention | Schedule |
---|---|---|
Daily | 7 Days | Nightly |
Weekly | 5 weeks | First day of the week |
Monthly | 7 Months | First day of the month |
Backups are stored in AWS Backup. A fully managed service for backup/restoration of AWS managed services.
AWS Backup targets the following managed services:
- AWS Relational Database Service
- Amazon EFS
Workflow
Creating a backup
skpr backup create prod
Restoring from a backup
skpr restore create stg BACKUP_ID
Scenarios
Application Level Recovery
Scenario
Data has been lost due to a code or user error in the application.
Solution
Restore the content of the site using the Skpr command-line interface.
Recovery Time Objective (RTO)
30 minutes
Time will vary depending on database and files size
Recovery Point Objective (RPO)
24hrs
Steps
- The development team recovers their data from the most recent backup using the skpr restore command.
- Skpr platform team assists in the restoration if required.
Platform Level Recovery
Scenario
Configuration which relates to the operation of the Skpr hosting platform has been deleted.
Solution
Reapply platform configuration.
Recovery Time Objective (RTO)
30 minutes
Recovery Point Objective (RPO)
24hrs
Steps
The Skpr platform team will reapply the configuration using infrastructure as code manifests.
Infrastructure Provider Level Recovery
Scenario
The underlying infrastructure provider (AWS) has had a catastrophic failure which results in downtime for the Skpr hosting platform. This will typically occur if the provider suffers a failure across all 3 availability zones.
Solution
Migrate the Skpr platform to a new region that is not experiencing failures.
Recovery Time Objective (RTO)
3 hours
Duration is dependent on multiple factors e.g. the ability to update DNS records.
Recovery Point Objective (RPO)
24hrs
Steps
- Skpr platform team determines the best region for the new Platform. Available options include:
- Singapore - This is the closest available data centre outside of Australia.
- Melbourne - Coming Soon
- The Skpr platform team will provision a new Skpr platform on the existing AWS account within a new region.
- The Skpr platform team will work with customers to help migrate their applications to the newly provisioned Skpr platform.
- The Skpr platform team will move application data (database, files etc.) to the new platform.
- Clients will review the recovered sites for defects prior.
- Clients will update their DNS records to direct traffic to the new site.
Considerations
- Data Sovereignty - The Skpr platform team will liaise with clients to determine if they have organizational data sovereignty rules which may prohibit the use of overseas data centres.
- DNS - Clients will need to update their DNS records to the new Skpr platform as part of the migration.
- Integrations - Development teams will need to test and ensure that the existing integrations with third-party services are still functioning post-migration.
- Static IPs - The Skpr hosting platform provides a set of static IPs for development teams to use when implementing a defence in depth strategy when configuring external services e.g. limit requests to a set of IPs on top of an existing authentication strategy. Clients will need to update these configurations with a new set of IPs that are provisioned on the new platform.
Alternative Solution
- Static Snapshot - Generate a static version of the site which requires very minimal infrastructure to operate. The Skpr platform work with the development team to generate the site and update the existing CDN to direct traffic to the temporary solution.
Security Level Recovery
Scenario
The Skpr platform team determines the security of the platform has been compromised.
The following plan does not cover tasks that will be completed as part of a security-related investigation.
Solution
Migrate the Skpr platform to a new AWS account.
Recovery Time Objective (RTO)
3 hours
Duration is dependent on multiple factors e.g. the ability to update DNS records.
Recovery Point Objective (RPO)
24hrs
Steps
- Skpr platform team provision a new Skpr platform on a new AWS account.
- Skpr platform team will work with customers to help migrate their applications to the newly provisioned Skpr platform.
- The Skpr platform team will move application data (database, files etc.) to the new platform.
- Clients will review the recovered sites for defects prior.
- Clients will update their DNS records to direct traffic to the new site.
Considerations
- DNS - Clients will need to update their DNS records to the new Skpr platform as part of the migration.
- Integrations - Development teams will need to test and ensure that the existing integrations with third-party services are still functioning post-migration.
- Static IPs - The Skpr hosting platform provides a set of static IPs for development teams to use when implementing a defence in depth strategy when configuring external services e.g. limit requests to a set of IPs on top of an existing authentication strategy. Clients will need to update these configurations with a new set of IPs that are provisioned on the new platform.