How to configure a multi-cloud application portability solution with Portworx – Part II

In part one of this series, we spoke about how you can deploy Portworx across a multi-cloud environment (AWS and Azure) using Red Hat OpenShift. In this post, we will talk about deploying the WordPress application and then performing a failover and failback operation for this application using Portworx PX-DR.

Deploying WordPress on Portworx

Once Portworx is installed, it’s time to deploy WordPress on the cluster. In this solution, we will configure Portworx to replicate at the namespace granularity. The first step is to create a new namespace for WordPress—I have called that aws-azure-migrationnamespace.

Once the namespace has been created, it’s time to install WordPress into the namespace using the instructions provided here.

Nugget #1: To absolutely ensure that the replicas were being distributed across two zones, I added the zones configuration parameter into my StorageClass with the two-zone names of my AWS and Azure zones.

Once WordPress is installed, it’s time to check the replication:

  • Gather the Persistent Volume Claim names in the aws-azure-migrationnamespace with the following command:
oc get pvc -n aws-azure-migrationnamespace -o custom-columns=:.spec.volumeName
  • Gather the name of the PX-Cluster pods in the kube-system namespace using the following command:
oc get pods -n kube-system | grep px-cluster
  • Locate the Portworx Volume ID that backs each PVC by executing the /opt/pwx/bin/pxctl volume list command grepping for each of the pvc names.
oc exec pods/px-cluster-83716c65-bbb2-45ee-bae9-dc0cb39bdf07-668p6 -n kube-system 
/opt/pwx/bin/pxctl volume list | grep pvc-561a2c0d-3ece-4f13-b429-c3225caa7e06
  • Finally, inspect the configuration of that volume ID using the following command:
oc exec pods/px-cluster-83716c65-bbb2-45ee-bae9-dc0cb39bdf07-668p6 -n kube-system 
/opt/pwx/bin/pxctl volume inspect 1116088806860818343

NOTE: notice how the replicas are distributed across the two sites?   This is topology-aware replication in action!

Nugget #2: You don’t need to create a Portworx StorageClass for each application – you can just create a StorageClass for each class of application – for example:

  • Tier 1 Non-Shared StorageClass – replication factor 3, high priority IO, read-write-once volumes
  • Tier 1 Shared StorageClass – replication factor 3, high priority IO, read-write-many volumes
  • Tier 2 Non- Shared StorageClass – replication factor 3, med priority IO, read-write-once volumes
  • Tier 2 Shared StorageClass – replication factor 3, med priority IO, read-write-many volumes
  • Tier 3 Non- Shared StorageClass – replication factor 2, low priority IO, read-write-once volumes
  • Tier 3 Shared StorageClass – replication factor 2, low priority IO, read-write-many volumes

Exposing WordPress to the external proxy / load-balancer

In the WordPress deployment above, the application was exposed to the network through a NodePort Service on the node TCP port 30303. Since I am using an external load balancer/proxy in this solution, each worker node’s TCP port 30303 needs to be accessible by this external proxy for heartbeat and request processing. To achieve this, I configured a public IP address for each OpenShift worker node, then added access control list (ACL) entries to allow traffic from the proxy server/services IP range to each worker node’s public IP address on TCP 30303.

Configuring STORK to replicate Kubernetes objects

Once our application’s persistent data is replicating between our two sites, it’s time to move on to the final piece of the solution: replication of the Kubernetes objects using STORK. The Portworx documentation provides very clear and concise instructions for integrating STORK with PX-DR in a synchronous environment. You can find those instructions here.

The basic steps are:

  • Install storkctl command-line tool
  • Pair the clusters – you can see a copy of my cluster pair yaml here
apiVersion: stork.libopenstorage.org/v1alpha1
kind: ClusterPair
metadata:
  creationTimestamp: null
  name: awsazurecluster
  namespace: aws-azure-migrationnamespace
spec:
  config:
    clusters:
      ocazure:
        LocationOfOrigin: /root/.kube/config
        certificate-authority-data: {my CA certificate}
    contexts:
      admin:
        LocationOfOrigin: /root/.kube/config
        cluster: ocazure
        user: admin
    current-context: admin
    preferences: {}
    users:
      admin:
        LocationOfOrigin: /root/.kube/config
        client-certificate-data: {my client certificate}
  options:
    ip: "{ip of DR site Portworx node}"
    port: "17001"
    token: "{my token}"
    mode: DisasterRecovery
status:
  remoteStorageId: ""
  schedulerStatus: ""
  storageStatus: ""

Note: Portworx on OpenShift uses TCP pot 17001 – 17020 (as opposed to 9001-9020 on other platforms) therefore we need to tell Portworx which TCP port (17001) to contact PX/OpenShift on in the remote cluster.  We do that by making alterations to the options section in the ClusterPair yaml file (seen above).

  • Configure the migration schedule – you can see the migration schedule I used here:
apiVersion: stork.libopenstorage.org/v1alpha1
kind: MigrationSchedule
metadata:
  name: aws-azure-migrationnamespace-schedule
  namespace: aws-azure-migrationnamespace
spec:
  template:
    spec:
      clusterPair: awsazurecluster
      includeResources: true
      includeVolumes: false
      purgeDeletedResources: true
      startApplications: false
      namespaces:
      - aws-azure-migrationnamespace
  schedulePolicyName: aws-azure-migrationnamespace-schedule
  • Monitor the replication using the Openshift GUI or via the OpenShift Command Line

oc describe migrations -n aws-azure-migrationnamespace

Nugget #3: In order for STORK in your production cluster to write the Kubernetes objects into your disaster recovery cluster, it must be able to resolve the DNS name for the Kubernetes API service in the remote zone.  See the Routing, DNS, and ACL section in Part I for some ways to achieve that.

  • Write your failover and failback scripts.  To help you on your way, I have included my failover and failback scripts which you can find here:
Failover workloads – AWS to Azure
#!/bin/bash
# Failover script executes from the Master Server on AWS OpenShift Cluster
LOGFILE="/tmp/portworx-failover-$(date +"%Y_%m_%d_%I_%M_%p").log"
echo "Logging output to $LOGFILE"
# Run this on the destination cluster
echo "Deactivating AWS cluster"
ssh core@ocazure-k8s-master "sudo /usr/local/bin/storkctl deactivate clusterdomain ocaws" >>$LOGFILE 2>&1
ssh core@ocazure-k8s-master "sudo /usr/local/bin/storkctl get clusterdomainsstatus" >>$LOGFILE 2>&1

# Scale down the replicas
echo "Scaling pods down to 0 in AWS cluster"
kubectl scale --replicas 0 deployment/wordpress -n aws-azure-migrationnamespace  >>$LOGFILE 2>&1
kubectl scale --replicas 0 deployment/wordpress-mysql-ha -n aws-azure-migrationnamespace   >>$LOGFILE 2>&1

# stop the migration of the kubernetes objects so they don't get overwritten on the target
echo "Stopping the STORK migration of objects"
kubectl apply -f failover-application-suspend-schedule.yaml -n aws-azure-migrationnamespace  >>$LOGFILE 2>&1
storkctl get migrationschedule -n aws-azure-migrationnamespace  >>$LOGFILE 2>&1

# start the application on the destination cluster
echo "Scaling up the pods on Azure cluster"
ssh core@ocazure-k8s-master "sudo kubectl scale --replicas 1 deployment/wordpress-mysql-ha -n aws-azure-migrationnamespace" >>$LOGFILE 2>&1
ssh core@ocazure-k8s-master "sudo kubectl scale --replicas 3 deployment/wordpress -n aws-azure-migrationnamespace " >>$LOGFILE 2>&1
failover-application-suspend-schedule.yaml
apiVersion: stork.libopenstorage.org/v1alpha1
kind: MigrationSchedule
metadata:
  name: aws-azure-migrationnamespace-schedule
  namespace: aws-azure-migrationnamespace
spec:
  template:
    spec:
      clusterPair: awsazurecluster
      includeResources: true
      startApplications: false
      includeVolumes: false
      purgeDeletedResources: true
      namespaces:
      - aws-azure-migrationnamespace
  schedulePolicyName: aws-azure-migrationnamespace-schedule
  suspend: true
Failback workloads – Azure to AWS
#!/bin/bash
# Failback script executes from the Master Server on AWS OpenShift Cluster
LOGFILE="/tmp/portworx-failback-$(date +"%Y_%m_%d_%I_%M_%p").log"
echo "Logging output to $LOGFILE"

# activate the AWS  cluster domain
echo "Activating ocaws clusterdomain"
ssh core@ocazure-k8s-master  "sudo /usr/local/bin/storkctl activate clusterdomain ocaws" >>$LOGFILE 2>&1

# get the cluster status
echo "Getting ClusterDomain status"
ssh core@ocazure-k8s-master "sudo /usr/local/bin/storkctl get clusterdomainsstatus" >>$LOGFILE 2>&1

# stop the applications on the cluster
echo "Scaling Azure pods to 0"
ssh core@ocazure-k8s-master "sudo kubectl scale --replicas 0 deployment/wordpress -n aws-azure-migrationnamespace" >>$LOGFILE 2>&1
ssh core@ocazure-k8s-master "sudo kubectl scale --replicas 0 deployment/wordpress-mysql-ha -n aws-azure-migrationnamespace" >>$LOGFILE 2>&1

# scale up the pods on the source cluster
echo "Scaling up pods in AWS Cluster"
kubectl scale --replicas 1 deployment/wordpress-mysql-ha -n aws-azure-migrationnamespace  >>$LOGFILE 2>&1
kubectl scale --replicas 3 deployment/wordpress -n aws-azure-migrationnamespace >>$LOGFILE 2>&1

# re-enable the schedule
echo "Re-enabling STORK replication of objects"
kubectl apply -f failback-application-resume-schedule.yaml -n aws-azure-migrationnamespace  >>$LOGFILE 2>&1
failback-application-resume-schedule.yaml
apiVersion: stork.libopenstorage.org/v1alpha1
kind: MigrationSchedule
metadata:
  name: aws-azure-migrationnamespace-schedule
  namespace: aws-azure-migrationnamespace
spec:
  template:
    spec:
      clusterPair: awsazurecluster
      includeResources: true
      startApplications: false
      includeVolumes: false
      purgeDeletedResources: true
      namespaces:
      - aws-azure-migrationnamespace
  schedulePolicyName: aws-azure-migrationnamespace-schedule
  suspend: false

Failing over the application on the Frontend

With the platform parts of the solution finished, it’s time to think about how we will provide a seamless experience for our front-end clients. When architecting for scaling (and failure) in a horizontally scaling, cloud native application located in a single cloud, you just provision a cloud load balancer. But to achieve that seamless experience when your application spans two clouds, you need to use a load balancer that sits external to both clouds. For my solution, I successfully tested two different methods:

  • An NGINX reverse proxy located in a separate location, caching and proxying my application and monitoring failover (and failures) with heartbeats.  You can find a skeleton/un-optimized NGINX config that I used here:
worker_processes  1;
events {
    worker_connections  1024;
}
http {
    proxy_cache_path /nginx-cache/ levels=1:2 keys_zone=STATIC:100m
    inactive=1h  max_size=1g;
    upstream puremite.karaka.nz {
         server (aws server public IP #1):30304 fail_timeout=3s ;
         server (aws server public IP #2):30304 fail_timeout=3s ;
         server (aws server public IP #3):30304 fail_timeout=3s ;
         server (azure server public IP #1):30304 fail_timeout=3s;
         server (azure server public IP #2):30304 fail_timeout=3s;
         server (azure server public IP #3):30304 fail_timeout=3s;
    }
    server {
      listen       80;
      server_name puremite.karaka.nz;
      location / {
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $remote_addr;
        proxy_pass http://puremite.karaka.nz;
        proxy_cache STATIC;
        proxy_cache_valid any 5m;
        proxy_cache_use_stale error timeout invalid_header updating
                              http_500 http_502 http_503 http_504;
      }
    }
}
  • A third-party service, caching and reverse proxying my application and monitoring failover (and failures) using heartbeats.

Generally, I have found Portworx to be very fast to failover and failback, but the addition of caching in my front-end proxy just smoothed out that failover for static pages.

Nugget #4: To optimize the load balancer heartbeat process, I created a small ‘Hello World’ PHP script in the WordPress Document Root and configured it as the target URL for the heartbeat process for both methods.  In production, your heartbeat target URL might be less static and more application-aware.

Demo

I have spent a lot of space describing my project here, but you probably would like to see what this looks like in real life? Have a look at this video to see the entire solution in action.

Wrapping it all up

Honestly, in the early days, I was skeptical about where Kubernetes was going to end up, but over time, it has converted me into a fan with its application focus to availability, scaling, life-cycling, and cross-platform portability. Kubernetes offers enterprises a very powerful platform to create cloud-native applications faster, for any cloud, and at any scale.

At the beginning of this project, I set out to answer four questions:

  • Kubernetes workloads are supposed to be portable, but could I truly make an application portable across two platforms—like two public clouds or one private cloud and one public cloud? 
  • If Kubernetes is now becoming a very feasible candidate for stateful applications, how do I provide disaster recovery for those applications?
  • What tools would I use to provide this application portability and disaster recovery?  
  • How would I orchestrate the failover since these are entirely separate environments?

The answers to all the questions were pretty clear:

  • With Kubernetes and Portworx, applications are portable between different sites, clouds, and platforms.
  • With Kubernetes and Portworx, we can design an excellent disaster recovery solution for our mission-critical workloads.
  • Red Hat OpenShift and Portworx are widely used and proven tools for this use case.
  • We deal with failover the same way we deal with auto-scaling in a cloud-native application—via a load balancer.

Red Hat OpenShift Assessment

With Red Hat OpenShift, from the minute I first looked at the installation guide, it was immediately clear that it is putting the enterprise into Kubernetes. It’s easy to install, easy to integrate and manage, and it comes with the solid release management, security, and enterprise support that customers have come to expect from Red Hat. While my tour of OpenShift was short on this project, it won’t be my last outing, and I am excited to learn more about OpenShift this year.

Portworx Assessment

As for Portworx, it’s rare that a product captures my attention the way that Portworx has. My appreciation for its design and the value it offers customers continues to increase the more I look under the hood. In my opinion, Portworx is a must have for customers who want to run container environments at scale. If you haven’t had a chance to give Portworx a try, you should. You can download a free trial or schedule a call with Portworx to learn more.

Answering an Important Question: Should You Do It?

If you are going to take anything away from this project of mine, it should be that Kubernetes, Portworx, and Red Hat OpenShift are all massively powerful technologies that enable data mobility, regardless of where it resides. With data mobility comes choice and flexibility, both of which are very important considerations for any IT strategy in a fast-moving technology landscape.

When you are considering production and disaster recovery between two different public clouds, there will be a number of other things—like network egress fees and public cloud edge outages—to consider and manage, and these may preclude inter-cloud disaster recovery being the right solution for you.  Yet, an architecture that includes one on-prem datacenter and one public cloud datacenter is also worth some consideration. It may provide the best of both worlds—the performance and cost characteristics of keeping data local, the elastic nature of the public cloud with minimal network egress fees, and the flexibility to migrate your data somewhere else, should things not work out.

Andy Hughes

Systems Engineer, Enterprise | Pure Storage

Share Share on Facebook Tweet about this on Twitter Share on LinkedIn



Back to Blog