We recently discussed different disaster recovery strategies for Kubernetes and covered the different types of failure…
August 26, 2019
Understanding Disaster Recovery, RTO and RPO on Kubernetes
Disasters come in many forms, that’s a fact. A disaster doesn’t always mean a natural disaster such as a hurricane or a tornado but can also be a network outage, datacenter fire, hardware failure, cloud provider outage or even a data breach. Whichever disaster occurs, one factor remains consistent, the need to recover data and application state to another site, which is no small task. Pile on the need to keep operational overhead and monetary fallout to a minimum and your head can start to spin. A proper disaster recovery plan is very detailed, and there are many trade-offs and scenarios to calculate. This blog post will focus on understanding two important disaster recovery concepts Recovery Time Objective (RTO) and Recovery Point Objective (RPO) and how they relate to disaster recovery on Kubernetes. We’ll also show how you can go about achieving different levels of RTO and RPO with Portworx.
What are RTO and RPO
RTO stands for Recovery Time Objective. What this translates to is the maximum amount of time that a service or application can be down or inaccessible following a disaster. This amount of time should fall within the bounds of “acceptable” by the company’s SLAs (Service Level Agreements), and policies. In short, the amount of time without causing notable friction to a business. Typically this is bound to some monetary value which is protected by some SLA. As an example, if a system is managing health records and doctors access this system while taking care of a patient, each minute this service is not available is valuable time that is lost associated with patient health. You could imagine this service would need a low RTO and RPO that is tied to a fairly aggressive SLA.
RPO stands for Recovery Point Objective. This term translates to the amount of data loss a company can tolerate following a disaster. Toleration of data loss can be interpreted as the amount of time between an outage event and the most recent snapshot or backup of the related data. For example, if you backup or snapshot your data every 12 hours at noon and midnight, then, the worse case scenario may be that you will lose nearly 12 hours of data if an outage occurs on or near the 24th hour of the day. This may be acceptable to some applications which can rebuild data from other sources or have fairly infrequent access, however this may be far from acceptable for other types of applications.
451 Research recently surveyed enterprise IT leaders and found that RPOs and RTOs decrease with an increase in the criticality of an applications. For example, 48% of mission critical applications require an RTO of less than 1 hr while 57% require an RPO of less than 1 hour.
Later in this blog we’ll explore some examples and scenarios related to RTO and RPO as they relate to applications running on Kubernetes.
Applications on Kubernetes
Kubernetes in and of itself has many APIs, services and systems running which may need SLAs and disaster recovery planning. However, we’re going to focus on the applications running on top of Kubernetes for this blog and assume system level operations teams have taken good care of the underlying Kubernetes substrate. You can read more about the Kubernetes architecture on the official Kubernetes documentation if you want to dig in.
Applications running on top of Kubernetes generally run as containers within pods. Each pod can use CPU and memory from the node on which it is running and can optionally connect to storage on the host or via data management systems like Portworx. Each pod runs in one or more replicas and expose themselves as services or connect to other pods and applications within Kubernetes.
Kubernetes provides some failure resilience within a single cluster such that applications running as pods in Kubernetes are mostly resilient to container, pod, network and node failures. This is because the Kubernetes scheduler will reschedule the pod based on the health of the application or node. Note that if your pod is attached to some type of storage then the data management layer must react and be able to detach, attach and remount the storage on the new node for the pod. Portworx’s Stork (Storage Operator for Kubernetes) enables this to seamlessly happen behind the scenes even when nodes fail. Stork also acts as an aid in enforcing data locality.
Cluster level failure resilience is great, but what happens when an entire site, datacenter or Kubernetes cluster is down?
This is where disaster recovery planning comes in. Good questions to ask include:
- What data do you need to back up?
- How often do we need to back it up?
- How much data loss can we endure without significant customer issues?
- How quickly do we need to guarantee our service comes back up?
- Do we need an active-active DR strategy?
- How can we get applications on our DR site running on Kubernetes as quickly as possible?
Keep in mind that the answer to these questions will change depending on how critical the application is, but also should not be ignored for non-critical applications either. 451 Research states:
“Enterprises and smaller organizations are unwavering in their demands when it comes to recovering workloads and data critical to their business and overall mission. The price of downtime and data loss can be massive and costly outages regularly make headlines. There is higher tolerance for downtime on noncritical apps and data, but for almost half of customers, the RTO (Recovery Time Objective) expectations of less than a day demonstrate that even less-critical apps and data are important and require rapid recovery. Additionally, enterprises will need to move toward data management practices that encompass not just the backup of applications and data but also the ability to restore them quickly after outages as well as make the data available to the business so that insight can be derived from analytic and machine learning endeavors, regardless of whether they exist on-premises or in the cloud.”
Note that there are various other ways in which applications can run and we will not cover every single one, rather we will explore an example of an application running on Kubernetes and what disaster recovery techniques and planning can be used to achieve levels of RPO and RTO.
How to achieve RPO and RTO times with Kubernetes
Let’s dig into two scenarios.
- “University Schedule”: This app provides course schedules to students so that can login and view their schedule in realtime after they went to the registrar and registered for classes. Students use this service fairly often but it is not critical for students to succeed and this service can be down for up to 6 hours without much issue. Data is local in the datacenter on campus so as a backup, the university runs a standby disaster recovery site in the cloud and uses an encrypted connection over the Internet to backup it’s data. We can say that for this application the goal is an RPO of 6 hours and and RTO of < 1 hour.
- “Patient App”: This app is part of a cloud-based healthcare system for doctors offices and doctors use the services in real time when taking care of patients. These services and data cannot be down for more than a few minutes so it’s critical to have an active site that can be synchronously replicated over a low latency network connection. The primary data center in which the applications runs in has a sister site in the same city connected via high-speed fiber. We can say the RPO should be 0 for this application with an RTO of < 5 minutes.
For this application we defined an RPO of 6 hours and and RTO of < 1 hour. We can achieve this by setting up another Kubernetes cluster as our DR site in the cloud. We can then connect these clusters via a Portworx Enterprise ClusterPair which will allow us to tie a policy to this DR site. This policy can define the following parameters:
- Interval: the interval in minutes after which the action should be triggered
- Daily: the time at which the action should be triggered every day
- Weekly: the day of the week and the time on that day when the action should be triggered
- Monthly: the date of the month and the time on that date when the action should be triggered
For this use case, specifying an interval of 360 minutes (6 hours) would meet our RPO. We also can set the migration to sync the Kubernetes objects such as deployments, statefulsets, secrets and services that belong to the app.
This makes both the data and application available in our DR site. Note that the most recent backup we can have will be from 6 hours ago. When the time comes, or a disaster occurs, we can simply start the applications on the DR site, which will use a volume based off the most recent backup. This can be done by using
kubectl, and once complete, network routes can be pointed to the DR site to make the application available. The
kubectl and route change should easily take place within the 1 hour RTO goal we had, therefore using these tools we can achieve our RPO and RTO goals. In the future we could always adjust our RPO as well by making sure our policy runs backups at various times and intervals.
For this application we defined an RPO of 0 with an RTO of < 5 minutes. This means we need our data available in both sites at all times. To achieve this we want to setup synchronous replication such that when data is written to our primary location a replica of the data is also placed in the DR site. This will make sure all data is always available in both sites. From there we can setup a schedule policy much like we did in our University app that syncs the Kubernetes application objects, but in this case we’re going to sync them every 1 minute so that we are well within our <5 minute RTO and can track any recent changes to the yaml objects.
In this case, when a failover occurs, the witness will mark the primary domain as down and the DR site will become the only active site and the applications can be turned on immediately. We already have a replica of our data present so no new data movement needs to occur making the application available as fast as Kubernetes can schedule it. This should be within the 5 minute RTO goal we had set.
The other thing to take into account is to make sure there is some automation around failing over of any DNS routes and records such that the DNS knows to point at the healthy endpoint. AWS Route53 with health checks is an example of such a service but other services can be used to achieve this as well.
If you learn better by seeing things work live we have some video examples close to what we have discussed here. You can find them here:
Keep in mind RTO and RPO are related and have similarities, they both are a measure of time, however RPO has one big difference, it is also a measure of the quantity of data that is lost. Losing the ability to access an application for customers for an hour is one thing compared to potentially losing hundreds or even thousands of transactions for a customer which equates to losing money. It’s important to qualify the level of RTO and RPO needed by carefully interrogating the disaster recovery plan. Consider the tradeoffs, SLAs and even cost. The difference between running active-active vs standby or cold backups varies based on the level of infrastructure needed but for some applications is a must.
With Portworx and Kubernetes you can achieve varying levels for RPO and RTO for both critical and non-critical applications alike.