Kubernetes Tutorial: How to Failover PostgreSQL on Google Kubernetes Engine (GKE)

This tutorial is a walk-through of how to failover PostgreSQL on Google Kubernetes Engine (GKE) by Certified Kubernetes Administrator (CKA) and Application Developer (CKAD) Janakiram MSV.

TRANSCRIPT:

Janakiram MSV: Hi. In this demo, I want to show you how to perform a failover of a PostgreSQL database running within Kubernetes backed by Portworx. We already have a running Postgres deployment, which is just running one replica set or one instance of the database. Even though we are running just one instance, we can achieve failover and high availability thanks to Portworx. So let’s see how to achieve HA and failover of the single database pod running PostgreSQL database.

The first thing that we are going to do is to populate our database with some sample data. So for that, I’m going to grab the name of the pod by querying on the label that means app is equal to Postgres. So, this will get us the only pod that is running PostgreSQL and then we are going to do an exec to get into the shell of this pod. Now we are inside the Postgres pod and we can go ahead and execute the psql CLI.

So now we are within the Postgres shell and we can run a bunch of commands to populate a sample database. So let’s go ahead and create a database called PX Demo. Now we can verify this with the slash L command, and we notice that PX Demo database is now created. Let’s quit this and then populate this with some sample data. So we take help of the pgbench CLI to do that. So this is going to create about 5 million rows inside the demo database that we just created. That’s a lot of data. So let’s give it a few seconds and we should have a fully populated sample database.

Alright, let’s verify the creation of this by again getting into the psql shell and querying for the tables. So you noticed that pgbench has created a set of tables and we can now get the count of the number of rows from the accounts table, pgbench hyphen accounts. And that shows about 5 millions rows. Perfect. Now, let’s quit the pgSQL shell and also get it off the pod. So what we are now going to do is to do something pretty crazy. So we’re going to cordon off the node that’s running the Postgres database, which means it will avoid scheduling any new pods on it.

So to do that, we need to get the name of the node that is primarily running our Postgres database. So when we actually do get pods with this command, we notice that it is running on one of these nodes so we’re going to grab the node name to cordon off the node to avoid further scheduling. What I’m trying to achieve here is to emulate a node failure by cordoning it off, and once we cordon off the node and even going ahead and deleting the pod, we’ll not disrupt our workload because immediately Portworx is going to make sure that the newly created pod is going to be deployed in one of the nodes that already has the data.

So let’s emulate node failure by creating a cordon node. Before that, let’s take a look at the nodes. You notice that all the three nodes are in ready state. All of them are capable of scheduling pods and running them. So we will now go ahead and cordon off the node that is currently running the PostgreSQL pod. So now if we do a “kubectl get nodes”, you notice that the second node, which is primarily responsible for running our current pod is cordoned off. So that means no more pods are going to be scheduled.

Alright. Now, let’s also go ahead and delete the pod. So for that, I’m going to get the pod name of the Postgres database. So this is the pod name. The only pod that we have in the default namespace. So now what we are going to do is to actually delete this pod. So “kubectl delete pod”. Now, it is going to delete the pod. And as soon as that is done, because it’s a part of deployment, Kubernetes will automatically create a new pod because we set the replica set for this deployment as one, and now a new pod is created, but interestingly this pod is not going to be located on the same node, where we originally ran the pod. Now, the node is cordoned off, that means it cannot be scheduled on this anymore instead it has been scheduled on the third node.

So, with this complete exercise of simulating a node failure followed by deleting the pod, in normal cases, we might actually face the risk of losing data, but not when we are running Portworx, so let’s check that. Before that, let me make sure I uncordoned the node. And now, let’s see if our data is still intact. So we’ll now grab the name of the new pod that is scheduled in the third node, and we will get into the shell of this pod then run psql with PX Demo and then we are going to check for all the tables. There we go. Everything is intact. And finally, the acid test where we are checking the row count of pgbench accounts and it is still five million. Fantastic.

So even after the node failure and pod deletion, the data is intact. That’s because of the replication factor that we created during the storage class and behind the scenes there is also the magic of STORK, which is the storage orchestrator for Kubernetes engine, created by Portworx. It’s a custom scheduler which will ensure that the newly created pod is always placed on the same node that has the data available. So the combination of STORK combined by the replication factor resulted in the high availability of the pod even while running a single instance of the workload.