In part one of this series, we spoke about how you can deploy Portworx across…
October 5, 2021
How to configure a multi-cloud application portability solution with Portworx – Part I
Containers and Kubernetes are getting a lot of airtime right now, with organizations increasingly looking at how these solutions fit into their next generation and cloud native architectures—so during a recent COVID lockdown, I was looking for a project to get stuck into, and Kubernetes seemed like a great place to start. The container concept wasn’t new to me—I had previously worked with Linux VServers and Solaris Zones in the 2000s, and I have been researching and running Kubernetes for a few years now, but I had a few burning questions I wanted to answer:
- Kubernetes workloads are supposed to be portable, but could I truly make an application portable across two platforms—like two public clouds or one private cloud and one public cloud?
- If Kubernetes is now becoming a feasible candidate for stateful applications, how do I provide disaster recovery for those applications?
- What tools would I use to provide this application portability and disaster recovery?
- How would I orchestrate the failover, especially since these are entirely separate environments?
This was my project to get me through those long weeks of lockdown. What follows is what I discovered over the course of this project.
The first step I needed to take was to define an application. I wanted the application to be as widely relevant as possible, and WordPress seemed the best place to start. WordPress is a free and open-source content management system written in PHP and paired with a MySQL or MariaDB database. According to their website, WordPress runs 42% of websites on the internet, and MySQL is also widely deployed inside organizations today, so it seemed like a widely relevant application stack to work with.
Since I was working in a limited timeframe and within the multi-cloud context, I wanted to work with a common Kubernetes platform across the entire multi-cloud environment. While this wasn’t strictly necessary, my focus was to work with a proven platform that I could deploy fast and with no inconsistencies, and I found that in Red Hat OpenShift. Red Hat OpenShift is the leading enterprise Kubernetes platform built for an open hybrid cloud strategy—it provides very easy user-driven installation capability for a number of platforms, including bare metal, VMware, AWS, Azure, and GCP.
When you are architecting a disaster recovery strategy for Kubernetes, the overwhelming recommendation is to deploy separate Kubernetes clusters in each site and replicate the objects and the data between the clusters. Based on that, I chose to deploy Red Hat OpenShift on Amazon Web Services and Microsoft Azure public clouds, as shown in the diagram below.
The Data Plane
To orchestrate a failover in a multi-cloud environment, I would need to replicate my container configuration and data between the two environments, so I would need a common data plane. I could have looked to application-level replication (i.e., MySQL replication) to achieve this, but it wouldn’t cover the other critical stateful parts of my application—so my recovery process was going to be long and laborious. I wanted to find a solution where I could replicate everything associated with an application (container config and secrets, SQL, file systems, message queues, in-memory caching instances, machine learning platform, monitoring tools, backups, etc.) with one set of schedules and one management and monitoring pane. To achieve this, I looked to Portworx, an industry-leading Kubernetes data platform that Pure Storage acquired back in late 2020.
Portworx provides a software-defined container storage platform that deploys within the Kubernetes cluster itself, and it delivers high availability, security, quality of service, and backup and disaster recovery for workloads within the cluster, whether located on-prem or in the cloud. Portworx is supported on a wide range of Kubernetes distributions, including Kubernetes.io, Red Hat OpenShift, AWS EKS, Azure AKS, Google GKE, VMware Tanzu, Rancher, and more. Portworx has a number of excellent features that help customers run simpler, faster, and more scalable and resilient container environments, but for this project, I focused on just two Portworx solutions: PX-DR and STORK.
- PX-DR provides replication of Kubernetes persistent data volumes at container granularity and with single command protection and restoration capability. For this solution, I used PX-DR to synchronously replicate data between the two sites. Synchronous replication with PX-DR requires a shared ETCD cluster for quorum—something I will cover later.
- STORK provides us with two key features in this architecture:
- The async replication of Kubernetes objects (like deployments, pods, persistent volume claims, and service specifications) between independent clusters
- Topology-aware scheduling of pods onto the worker nodes that are closest to the data, via a Kubernetes scheduler extension: when our primary site is active, STORK schedules pods to the worker nodes in the primary site, and when the secondary site is active, STORK schedules pods to the worker nodes in the secondary site only
The below diagram illustrates how STORK and PX-DR work together to provide a synchronous disaster recovery solution for Kubernetes.
Working with the network
My goal was to achieve a common data-plane, so I would need a common network between my AWS and Azure tenancies. I also wanted a near-seamless failover, so I would need as near to synchronous replication as I could get. Portworx can perform synchronous replication between sites where the round-trip time (RTT) is less than 10ms. Choosing one AWS data center and one Azure data center located within the same metropolitan area makes this requirement easy to meet. After creating a tenancy in both Azure and AWS, the next step is to connect the two clouds using a VPN.
Microsoft provides some good documentation on setting up a VPN connection between Azure and AWS here.
Nugget #1: Microsoft Azure expects the Local and Remote IPv4 Network CIDR to be configured as 0.0.0.0/0 on the AWS side
Once the VPN is established, it’s time to move on to setting up the OpenShift Cluster environments.
Installing Red Hat OpenShift
Red Hat OpenShift provides a very easy customer-driven installation process to deploy OpenShift into both Azure and AWS – the installer can be configured in one of two modes:
- Installer-Provisioned – the installer will provision a new AWS VPC (or Azure VNet) and deploy all the subnets, load balancers, route tables, ACLs, and DNS records required to make OpenShift operate
- User-Provisioned – the customer provisions the AWS VPC (or Azure VNet) and the subnets, DNS records, configure route tables, and Access Controls
I opted to take the user-provisioned route for slightly more control, but the installation was quick and painless. You can find more information on the OpenShift installation process through the following links:
Installing Red Hat OpenShift CLI
Whilst most of the configuration tasks can be undertaken from the OpenShift Web Console, a couple of the Portworx cluster monitoring commands are only available via the CLI. You can find the OpenShift CLI installation instructions here
Configuring OpenShift for Portworx Synchronous Replication
One of the really nice things about Portworx is that it has topology-aware replication. This is a feature that is really worth checking out.
Kubernetes nodes with Portworx installed can be made aware of what rack, zone, or region they are deployed within, and Portworx will use this information to influence the placement of the volume replicas. Here is the default replication behavior for each topology:
- Rack: When nodes are in different racks, the administrator can manually specify which rack they want volume replicas to reside within.
- Zone: When nodes are in different zones, Portworx will automatically try to keep the replicas of a volume in different zones.
- Region: When nodes are in different regions, Portworx will automatically try to keep the replicas of a volume in the same region.
You can find out more about topology in Portworx here.
When Portworx is installed on a Kubernetes node, it will pre-populate the Portworx region and zone with the settings:
- For cloud environments: failure-domain.beta.kubernetes.io/region and failure-domain.beta.kubernetes.io/zone label on the Kubernetes node
- For on-prem environments: px/region or px/zone label on the Kubernetes node
- If this doesn’t happen: PX_RACK, PX_ZONE, PX_REGION values can be configured in /etc/pwx/px_env file on each node
You can find out more about setting topology labels here. For this solution, I wanted a copy/replica of the volume data on the local nodes and the remote nodes. Based on the rules above, all nodes for my solution had to be in the same region but in different zones.
By default, when OpenShift is deployed on AWS or Azure, it will configure the failure-domain.beta.kubernetes.io/region and failure-domain.beta.kubernetes.io/zone to the AWS/Azure region and zone code (i.e., ap-southeast and australiaeast). Portworx will inherit these settings during the installation, leaving the Azure and AWS worker nodes in different Portworx regions, not distributing replicas between the two sites. Adding a px/region label to each node to configure the Portworx cluster as one region will overcome this.
We can do this in the OpenShift GUI:
- Browse to Compute -> Nodes -> Select each worker node
- Click YAML -> Expand Metadata -> add px/region label (px/region: ocsydney in my case)
Routing, Access Control and DNS
As the Kubernetes clusters are independent and isolated from each other, the only communication between the two clusters occurs at the Portworx data platform layer. The routing and access control requirements are driven by Portworx’s requirements, which are outlined here
- Portworx: IP connectivity is required between all the master nodes and worker nodes in both sites via the VPN tunnel. Portworx TCP/UDP connectivity requirements include the following:
- TCP Ports 9001 – 9022 in both directions (or 17001 – 17020 on OpenShift)
- UDP port 9002 in both directions
- etcd Quorum: The etcd servers providing a cluster quorum also require routing and network connectivity with the cluster:
- TCP port 2379 between Kubernetes master and worker nodes and all the etcd servers
- TCP port 2380 between all the etcd servers bidirectionally to allow the cluster nodes to synchronize
- STORK: Stork requires DNS resolution for and TCP port connectivity to the Kubernetes API service on the remote Kubernetes master node to migrate Kubernetes objects. The ability for one Kubernetes cluster to resolve the other cluster’s API DNS name presents a challenge—it will require either of the following:
- The creation of the zone and DNS names, in the local DNS, that reflect the remote DNS zone and FQDN for the API service in the remote cluster
- A DNS forwarder in the local DNS server to resolve the remote DNS zone via the remote DNS server
For my solution, I opted to create a zone in each cluster’s local DNS server that reflected the DNS zone of the alternate site. STORK in the local cluster communicates with the remote cluster’s Kubernetes API service to facilitate the migration of the Kubernetes objects. To allow this to occur, connectivity to the remote cluster’s master node on TCP port 6443 is required.
Installing ETCD Cluster
To operate PX-DR in synchronous mode, the Portworx key-value database (KVDB) must be deployed on a shared etcd cluster. Portworx recommends that customers deploy a three-node etcd cluster, with one etcd node in each site and the remaining etcd deployed as a witness node in an independent third site to act as a quorum tie breaker, as detailed below.
To meet this requirement, I deployed a single VM in each site and followed the very easy installation instructions that Portworx provides at the following link
Nugget #2: If you are having trouble getting your ETCD cluster to converge after start-up/reboot, try setting the TimeoutStartSec to 600s inside file /etc/systemd/system/etcd3.service and rebooting all the ETCD servers at the same time.
[Service] Type=notify Restart=always RestartSec=25s LimitNOFILE=40000 TimeoutStartSec=600s EnvironmentFile=/etc/etcd.conf
Now that you’ve got all the plumbing configured, it’s time to get Portworx installed. Installing Portworx on OpenShift involves just three easy steps.
Step 1. Install Portworx Enterprise Operator from Red Hat OpenShift GUI
The Portworx Enterprise operator is deployed through the OpenShift GUI—just browse to the OpenShift OperatorHub, search for Portworx Enterprise, and click Install. Complete this step on both OpenShift clusters.
Step 2. Configure Portworx installation specification
Portworx provides a web-based configurator to customize your Portworx installation.
The images below show the options I used to configure my implementation.
Once we finish the configuration steps above, we are presented with a customised kubectl command to install Portworx. Copy the URL from this command and proceed to the step below.
Step 3. Create Portworx installation specification in OpenShift
To finish the Portworx installation, copy just the URL to the customized specification provided by the configurator tool above, and paste it into your local browser. Select all the text and copy it to your clipboard.
Then complete the following step on both OpenShift clusters independently. Browse to the Red Hat OpenShift GUI:
- Click Installed Operators -> Portworx Enterprise
- Click Storage Cluster -> Create StorageCluster
- Click YAML View
- Delete the existing YAML and paste in your customized specification YAML
- VERY IMPORTANT: Add a line into the specification annotations specifying the cluster domain for the OpenShift cluster you are working on. For example:
- On OpenShift / AWS cluster: io/misc-args: “-cluster_domain ocazure”
- On OpenShift/Azure cluster: io/misc-args: “-cluster_domain ocaws”
- Click Create
- Browse to Pods, select kube-system project, and monitor the status of the Portworx pods as they install and come online. Portworx will be fully installed and ready to go when the STATUS of all pods show as Running.
Step 4. Once you have completed the above tasks on both clusters, you can check the status of the cluster by browsing to Installed Operators -> Portworx Enterprise
- Click on Storage Node and you should see all your cluster nodes installed and with a Status of Online.
That’s it for the first part. In the next blog, we will deploy WordPress and perform a failover and failback of that application between our AWS and Azure Red Hat OpenShift clusters.