The Challenges with Traditional Disaster Recovery Solutions
As enterprises increasingly use Kubernetes to handle stateful applications with persistent volumes, building an enterprise-grade Kubernetes platform entails more than simply deploying your applications. You need to protect them too. As application uptime becomes a baseline expectation, most enterprises simply can’t risk the impacts of a service outage or failure. While many organizations try to retrofit traditional disaster recovery solutions that are server-based, they lack the granularity needed to orchestrate DR protocols for containers. The result is a manual approach to DR, when what you need is automation.
Container granular and application aware
Kubernetes applications are container based, not virtual machine or server based. To effectively run Kubernetes backup and disaster recovery, replication needs to happen at the container level.
Automated disaster recovery
Relying on a manual process for Kubernetes application recovery is unreliable—it takes time you can’t risk and leaves room for error. Protection needs to just happen. You need to know that your recovery copies are meeting SLAs without intervention, as opposed to implementing complex scripts once a server has already gone down.
Support for all Kubernetes environments
You need support to recover applications in any environment – whether that’s on public cloud, private cloud, or on-prem.
Support for all your recovery SLAs
You need to be able to meet the recovery requirements for all your Kubernetes clusters–whether they are mission critical or tier 2 or 3. Synchronous and asynchronous DR is a requirement for the enterprise. Extended downtime is simply not an option.
Zero Recovery Point Objective (Zero RPO)
Recovery Point Objective (RPO) defines the amount of data loss you can tolerate in the event of a disaster. For mission-critical applications, the RPO is often zero, meaning no loss can be tolerated.
Low Recovery Time Objective (Low RTO)
Recovery Time Objective (RTO) defines how soon after a disaster the application can be recovered. For many mission-critical applications, this time frame is less than a few minutes.
Kubernetes Disaster Recovery
A Kubernetes disaster recovery plan details the process of recovering applications within a Kubernetes cluster and its applications in the event of a failure or service disruption to ensure limited downtime and business continuity. Understanding your application needs is a critical component to defining disaster recovery strategies for Kubernetes, especially since a container orchestration platform like Kubernetes has a few quirks requiring a different approach.
Kubernetes disaster recovery is different from traditional disaster recovery
The problem with traditional DR solutions is that they’re ill-suited to the way containers are built. A virtual machine by comparison is simple: An application lives on a single VM or group of VMs, and replicating the VM is usually sufficient when time comes to recover.
Kubernetes infrastructure, however, is purpose-built to be dynamic and distributed. A single VM may contain the components of multiple applications, while potentially missing all the components of any single application. Therefore, using a VM-based approach for Kubernetes clusters will prolong the recovery process. The components of a Kubernetes application include underlying resources like metadata and persistent data that is stored externally—and a VM solution does not understand how to restore to an application’s desired state, resulting in extended downtime or data loss. Building a DR strategy for Kubernetes applications using VMs means you cannot target specific applications or containers to recover.
Thus, it is best to implement a built-for-Kubernetes disaster recovery solution. They are designed to inherently recognize and restore all the various data and components in a Kubernetes cluster and their relationship to each other.
Key Elements of a Kubernetes disaster recovery plan
Creating an effective disaster recovery plan needs a thoughtful and multi-pronged approach. No singular approach will sufficiently cover all your applications, so it is important to take each of the following key elements into consideration when formulating your disaster recovery plan.
High availability refers to the ability of a system to continuously operate nearly 100% of the time with minimal downtime. Ensuring high availability is the foundation of any good data protection and disaster recovery plan in case of infrastructure failures within a single data center or availability zone. There are a few critical considerations to achieve high availability.
First, you need to plan for redundancy. Within a single Kubernetes cluster, for example, it is critical to run multiple copies of your application in case any singular pod fails. As soon as the failure occurs, your Kubernetes scheduler will automatically start up the failed pod onto a healthy node, making the application highly available to the end user. Since all the pods share the same storage, the data itself is not at risk and remains consistent.
Load balancing also ensures that any incoming traffic is evenly distributed across healthy pods, so no single pod is bearing the brunt of the traffic. This in turn enhances the availability and performance of the end application.
Replication is a necessary component of high availability. Without it, teams would have to rely on backup and restore, which is not fast enough to ensure highly available applications.
Backup and restore
Regular backups also play a key role in a Kubernetes disaster recovery plan, as backups can protect data from accidental deletion or ransomware. Backups come into play when the underlying data is compromised, and manual intervention is required. Regardless of the type of disaster recovery you need, backing up Kubernetes clusters can be challenging as its infrastructure can be complicated with multiple components to consider.
Kubernetes backups need to capture the entirety of the Kubernetes application, including cluster state, application configurations, and data stored externally on Persistent Volumes (PVs). This is the only way to ensure application-consistent backups with no corruption and limited data loss. When time comes to restore, a built-for-Kubernetes solution will be able to read and restore the application metadata to return the application to the desired state. The backups should also be container-granular, meaning you can target individual containers to restore without restoring an entire VM. When time comes to recover, a container-granular approach guarantees a quicker and more reliable recovery process.
Many Kubernetes data protection and disaster recovery service providers claim to offer disaster recovery in the form of backup and recovery. However, backup and restore is not considered true Kubernetes disaster recovery, because disaster recovery requires replication of application data to a secondary cluster.
Learn more on the Kubernetes Backup page.
Define SLAs for different application tiers
A disaster recovery plan is not a one size fits all approach. Not all applications have the same target RPO and RTO requirements. It is important to define the SLAs for the different tiers of applications you have. The most important question to consider is this: What tolerance do you have for data loss and data recovery time?
Mission-critical applications are often the most inflexible as far as tolerance goes. For example, in the financial services industry, customers expect banks and credit card companies to provide real-time information about their finances or credit card transactions. If there is missing data or service due to an outage, it could have catastrophic results for the company, resulting in loss of customer loyalty and brand equity. Many enterprise companies cannot tolerate any data loss for their mission-critical applications, requiring zero RPO and very low RTO to ensure there is limited application downtime.
Tier 2 and 3 applications tend to be more tolerant to data loss and recovery time, allowing for several hours or even up to a day of RPO and RTO. If your loss tolerance can be covered by your scheduled backup policy, those applications can be sufficiently protected without disaster recovery.
Understand the difference between synchronous and asynchronous disaster recovery
For those mission-critical Kubernetes applications that require zero RPO, synchronous disaster recovery is essential. This means that data from your primary site is automatically copied to another. Disaster recovery between Kubernetes clusters will not only replicate the application data between two clusters, but also application configurations and the cluster state, so at any given time, there is no difference between your primary copy and your recovery copy.
In the event the primary copy is compromised, a fast failover ensures that the identical recovery copy takes over as quickly as possible by redirecting incoming traffic, provisioning resources, and starting the applications in the replicated copy. There are very few built for Kubernetes disaster recovery solutions that can offer zero RPO disaster recovery, as synchronous disaster recovery requires replication at the storage level.
For tier 2 or tier 3 applications, asynchronous disaster recovery is usually sufficient. Asynchronous disaster recovery also requires application data to be replicated between two Kubernetes clusters. However, the replication does not happen at the same time that changes are made to the primary cluster. The replication to the secondary cluster usually happens on a scheduled basis, based on loss tolerance.
Some organizations may choose to protect their tier 2 or 3 applications using simple backup and restore, restoring the copy from the most recent backup. The tolerance here can match your backup schedules, based on how frequently you backup your applications.
However, with asynchronous disaster recovery or backup and restore, there will be a delta between the primary copy and the restore copy, so enterprises must be cognizant of the resulting data loss.
Key Factors of an Effective Kubernetes Disaster Recovery Strategy
Low latency requirements
For mission critical-applications that require zero RPO and near-zero RTO, it’s important to consider latency. Latency defines the speed at which data can be replicated from the primary copy to the secondary copy. Latency can be affected by a number of factors, including network connectivity, bandwidth utilization, and geographical distance.
To ensure the lowest possible latency, you will need the primary and secondary copies to be located in data centers within the same metro region, as latency increases as the distance between sites grows.
The essential characteristic of a DR strategy is repeatability. You have to rely on it in case of failure, and it should work flawlessly every time. In the event of a disaster, there are many complex issues that need to be addressed simultaneously. You want a repeatable process using a simple, easy-to-use solution that won’t require complex processes during a high-stress event.
Using a solution with automation makes a big difference here. Leveraging automation for your disaster recovery can help minimize errors and streamline the process. Monitoring tools can help detect issues early by giving visibility into the health and performance of your Kubernetes clusters.
But achieving complete repeatability takes more than just having the right tools and solutions. Everyone in your organization should also be trained to respond immediately during a disaster. A planned response will minimize downtime and data loss in the event of any failure.
Ensuring business continuity
Finally, a disaster recovery plan is a critical component of ensuring business continuity. At its core, a business continuity plan answers the critical question of how to continue critical operations in the event of a disaster.
This comprehensive plan includes not only the technical components of disaster recovery, like defining the SLA requirements and loss tolerance of critical and non-critical applications and setting up fast failover processes to start up any failed applications in healthy Kubernetes clusters, but it also includes key business operations, like maintaining clear and timely communication, training and testing relevant stakeholders, and continuously improving application resilience.
The goal of business continuity should be to minimize the impact of a disruption and continue essential operations as quickly as possible.
The Portworx Solution
Automate protection of your containerized applications with Kubernetes optimized cloud native disaster recovery.
Recover Entire Apps
Portworx doesn’t just protect data. We also protect your application configuration and Kubernetes objects, so that recovering your applications is as easy as redeploying your pods.
For data centers in a metro area, a single Portworx cluster can span two distinct Kubernetes clusters, enabling Zero RPO failover for mission-critical apps.
Because Portworx protects your application configuration and Kubernetes objects in addition to your data, we ensure extremely low RTO.
When replicating data between clusters in the same metro region, Portworx provides latency of less than 10 milliseconds.
All Stateful Apps
You don’t have to be an expert in each data service, because our app-specific capabilities automate DR for any data service.
Run On All Infrastructures
Portworx aggregates your underlying storage in the cloud (AWS EBS, Google PD, etc) or on-prem (Pure Storage Arrays, bare metal, NetApp, EMC, vSAN, etc) and turns it into a container-native storage fabric.