Business continuity and disaster recovery (BCDR) is a process that helps businesses recover data and operations in the event of a disaster. Disasters can manifest in many different ways, including human error, service outages, natural disasters, ransomware, and more. These disasters can lead to unrecoverable data loss or extended service outages if organizations do not have a BCDR and data protection plan in place.
Why is Business Continuity and Disaster Recovery (BCDR) Important?
Organizations are adopting containers at a rapid pace, with Gartner Inc. stating that “by 2029, more than 95% of global organizations will run containerized applications in production — a significant increase from fewer than 50% in 2023.”
As container adoption grows, so do the use cases for container applications, which have evolved from stateless workloads to stateful workloads. These stateful workloads are also evolving from small Kubernetes test environments to critical workloads that achieve key business objectives, like powering video streaming for customers or driving faster credit card transactions.
Customers and users expect always-on availability with little to no data loss for these types of critical applications—especially those that run their daily lives. If a company suffers from extended downtime or data loss, it could result in devastating effects to revenue, reputation, and even legal consequences.
As a result, having a business continuity and disaster recovery plan is crucial.
Key Concepts of BCDR
Business Continuity
Business continuity is the plan businesses set in place to keep operations running in the event of an incident, while simultaneously minimizing any disruptions. Business continuity tends to have a broader focus than disaster recovery and includes a plan across business functions to keep the organization running as smoothly as possible. Ultimately, business continuity aims to keep the business functioning after an incident.
Disaster Recovery
Disaster recovery is an IT plan to restore processes, infrastructure, and data after an event. A disaster recovery plan should outline the processes an engineering team must take in order to restore applications based on predefined service level agreements (SLAs) that determine how much data and downtime is tolerable based on the type of application that is affected by disaster.
Both a business continuity and disaster recovery plan should be in place to successfully run operations during a disaster and fully recover from one.
Identifying Risks
Technical modes of failure
Although data has become a mainstream part of everyday life, the infrastructure that holds it in place is subject to fragility. There are many ways a disaster can strike, but a good disaster recovery plan and solution can help mitigate the risks.
Hardware
Hardware failures are a common cause of downtime. Whether it’s a server malfunction, storage device failure, or network equipment breakdown, hardware issues can disrupt operations and lead to significant data loss. Organizations should consider adding redundancies, regular maintenance, and observability via monitoring and alerts to avoid egregious disruptions.
Software vendor solutions
Software failures whether due to bugs, compatibility issues, or vendor outages can also cripple your operations. Even the most reputable of software vendors may experience an outage or significant downtime, which is why it is especially important for IT teams to derisk their stack by investing in multiple clouds, databases, and software vendors.
Kubernetes components
Failures can also occur within the components of a Kubernetes cluster as well. Nodes or even entire clusters may fail due to network issues, lack of resources, misconfigurations, or more, and there must be a plan in place for failover to healthy nodes or clusters. Other considerations to provide more stability for clusters is implementing scaling policies that can adjust resources on-demand, so storage is always available when needed.
Human Error
Human error is another unfortunately common risk. In a recent incident with a major telecommunications provider, one of the engineers accidentally triggered a delete command that deleted all their volumes on a cluster. Several applications, including external-facing applications that directly impacted their customers, went offline. Without a business continuity or disaster recovery plan in place, those applications would have remained offline, interrupting operations for customers.
Cybersecurity Incidents
According to Sophos in their State of Ransomware 2024 Report, 59% of organizations experienced a ransomware attack within the past year. These attacks can happen for a number of reasons—phishing attempts, exploited vulnerabilities, malicious emails, compromised credentials, and more. The cost of a data breach is also significant, with IBM reporting that a single data breach costs an average of $4.88 million USD in lost revenue, downtime, and data breach response.
Any organization can be the target of a ransomware attack. So, they must have safeguards for their critical data as well as measures in place to recover any lost or compromised data.
Building an Effective BCDR Strategy
Developing Recovery Objectives
Not all applications are built the same. Some are running key, mission-critical applications that cannot tolerate any downtime or data loss. Others can go down for hours or even days without impacting the business. An important part of any disaster recovery plan is building out downtime and data loss tolerances for different tiers of applications and protecting them accordingly to keep the applications and your business running smoothly.
RPO / RTO
The two key concepts for downtime and data loss tolerance are Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
RPO measures the amount of data loss you can tolerate for a specific application in the event of a disaster. The RPO for mission-critical applications is zero data loss. For example, banks and credit card companies often handle sensitive financial information on behalf of their customers. If a bank or credit card company were to lose any financial data—like deposits or credit card transactions—it would have a disastrous effect on their customer base and the company itself.
RTO measures the amount of downtime you can tolerate for a specific application. RTO can theoretically get close to zero for mission-critical applications, but this will depend on how quickly a team is able to re-route traffic from a failed node to its replica. With a low enough RTO, downtime can be almost undetectable for an end user.
Components of a BCDR Plan
A disaster recovery plan needs to take into account these RPO and RTO tolerances for an organization’s Kubernetes applications. Often, platform teams choose to split applications into different tiers that outline the processes for disaster recovery.
Tier 1 applications, or mission-critical applications, are those applications that need zero RPO and near-zero RTO. In order to do this, synchronous replication is required at the storage level. Synchronous replication means data from the primary site is replicated to the secondary site, so there is no delta between the two. If anything happens to the primary site, infrastructure teams need only failover to the secondary site to prevent any data loss.
Tier 2 applications tend to have more flexible RPO and RTO requirements. These applications can be recovered using asynchronous replication. Using asynchronous replication, there is still replication between the primary and secondary sites. However, the replication does not happen at the same time there are any changes to the primary site. Usually, the replication will occur on a scheduled basis, as determined by the disaster recovery plan.
Tier 3 applications have the most flexible RPO and RTO requirements. These applications may not require replication at all and can instead rely on backups or snapshots to recover an application. Because there is no replica on a second site, recovering an application from a backup copy requires more steps. Tier 3 applications can tolerate longer outages and downtime, so recovering from a backup copy is usually sufficient to meet their more flexible SLAs.
Plan Implementation and Maintenance
Although testing a disaster recovery plan may not be the most thrilling task, it is an overlooked activity that helps ensure your applications suffer as little downtime and data loss as possible. When disaster strikes, engineering teams don’t have the luxury of time to get operations back up and running. Regular testing, both of backups and disaster recovery, is necessary to ensure applications can be fully restored quickly.
Engineering teams should be trained on DR protocols so they can quickly restore applications. One way to ensure proper training is to run scheduled tests that simulate disaster scenarios like regional or data center outages to ensure RPO and RTO goals are consistently met.
Another thing engineering teams creating a DR plan should consider is container infrastructure itself. Container applications are not virtual machine or server based, and replication needs to happen at the container level to ensure the data and the underlying metadata, like application configuration and objects can all be recovered at once. Engineering teams shouldn’t be spending additional time restoring data and the application separately or trying to rely on documentation to take them through recovery steps. Any DR solution for containerized applications should be able to recover specific containers or applications for targeted recovery.
Disasters are often high pressure and chaotic events. Organizations and engineering teams need to be prepared to get operations back up and running as quickly as possible. Teams should be trained to move fast and use container-aware DR solutions that can enable a quick recovery.
Building Resilience Outside of Your BCDR Strategy
Other Standards and Best Practices for Data Resilience
Disaster recovery alone cannot ensure data resilience. Organizations should consider other forms of protection, like backup and high availability to ensure they have a holistic approach that can provide a comprehensive insurance plan to help them recover quickly and with limited downtime.
High availability
High availability refers to the ability of IT infrastructure to continuously operate without intervention. Ensuring high availability is one of the key components of ensuring resilience in case of infrastructure failures.
Within a single Kubernetes cluster, for example, organizations should set up redundancies in case any singular pod fails. As soon as the failure occurs, your Kubernetes scheduler will automatically start up the failed pod onto a healthy node, making the application highly available to the end user. Since all the pods share the same storage, the data itself is not at risk and remains consistent.
Load balancing also ensures that any incoming traffic is evenly distributed across healthy pods, so no single pod is bearing the brunt of the traffic. This in turn enhances the availability and performance of the end application.
Backup
Backup and restore, often referred to as data protection, is another key component of data resilience. Infrastructure teams should turn to backups when the application data is compromised and manual intervention is required. Regular backups can help return an application back to its last known good state, thus limiting any data loss in the event of accidental deletion, data corruption, or ransomware attack.
It is important to not confuse backup and restore for disaster recovery. Disaster recovery requires replication at the storage level to ensure minimal to no data loss and limited downtime.
Portworx Business Continuity and Disaster Recovery Solutions
Portworx is the container data management solution that automates, protects, and unifies modern applications at enterprise scale. Portworx business continuity and disaster recovery solutions provide comprehensive data resilience for Kubernetes applications with flexible policies that allow for synchronous and asynchronous disaster recovery, high availability, and built-for-Kubernetes backup and restore for applications regardless of where they’re located—on-prem or in the cloud.
Using Portworx, enterprises can provide up to zero RPO and low RTO for all their mission critical applications with synchronous disaster recovery for clusters within the same metro region. Portworx also provides asynchronous disaster recovery for applications that have an RPO of 15 minutes or more. When disaster strikes, these applications only lose as much data as your RPO.
Portworx was purpose-built for Kubernetes workloads. It understands all the components of a Kubernetes application, from the underlying data to the application configurations, so recovery is quick and complete. Portworx also provides granular recovery that can target specific applications or namespaces for recovery.