December 3, 2017
Using External Persistent Volumes to Reduce Recovery Times and Achieve High Availability on DC/OS
Dinesh Israni, Senior Software Engineer at Portworx, gave a presentation and Q&A on “Using External Persistent Volumes to Reduce Recovery Times and Achieve High Availability on DC/OS” at MesosCon Europe. Besides reading this transcript, you can also watch a video of the presentation.
Welcome, everyone. My name is Dinesh Israni and I am a software engineer at Portworx. Today I’m going to be talking about how you can use external persistent volumes to achieve high availability and reduce recovery times when using stateful services on DC/OS. Let’s jump right in.
Here are topics that I’m going to cover today. We’ll talk about the different kinds of stateful services, then I’ll go through the advantages of using external persistent volumes. I’ll also give you an introduction about Portworx and how you can deploy services on DC/OS to take advantage of Portworx volumes. Then I will do a demo showing how you can install Portworx and Cassandra to use Portworx volumes and basically demonstrate failover and some other useful scenarios.
Let’s talk about stateful services. There are basically two types of stateful services when it comes to persistent data. The first are simple applications that don’t do their own replication. They basically rely on the underneath storage layer to be always available, so in cases of failures the storage layer would basically make the same volume available on another node, so that the application can come up with the same state.
The second type is where applications do their own replication across nodes, so in case a node dies or fails there is always another copy of the data on the cluster. So if the node that had crashed does come online, then the replication takes care of basically repairing data back onto that node that it had missed. Now this replication or reapplication or repair can either be manually triggered or it can be automatic, depending on the application.
Some of the examples for the first type of application are basically WordPress and MySQL, which run in simple mode and don’t do their own replication—whereas for the second type it would be Cassandra or HDFS, which basically do replication across nodes and then they do repairs in case of failures.
Now, you might be asking, why is this replication strategy important? Well it’s because bad things happen all the time. Your nodes could crash. Your network could have issues. For example, your network could get partitions, so none of your nodes are in quorum. Your disk could go down or your nodes could go down. Your entire rack could go down, bringing down multiple nodes.
You might be asking, why is this replication strategy important? Well it’s because bad things happen all the time.
For applications that do their own replication, there is always another copy on one of the other nodes in the cluster, so you can still continue to serve IOs, and your application will not be affected by one node being down. And if you have to replace a node, you can just bootstrap it and repair all the necessary data to it. This does end up taking a lot of time sometimes, depending on how much data you had on the node that did go down. And in that time that you’re repairing data to that node, your throughput for IO for that service would basically drop.
For instance, if you had a Cassandra cluster that was doing a three-way replication and one of your nodes goes down, you would then only be able to serve reads from two of the nodes instead of being able to serve from three of the nodes. And while the repair is going down, your network would also take a hit because all the data would need to be transferred from the two nodes back to the one node that is being brought up.
For non-clustered applications, though, if you had no backup and were using local storage, your application would be doomed because you would not be able to bring it back up until you can either move your data from the data disk from the nodes that went down. And in case your data nodes actually had physical corruption, then you will basically end up losing your data.
How can external persistent storage help in all of this? Well, for the case of applications like MySQL and WordPress that don’t do their own replication, it can help to provide high availability for your services, and it basically makes sure that your down time is eliminated. And for services that do do their own replication, it can help reduce recovery times by a large amount, because what will end up happening is you will not have to bootstrap data back to your new node. All you would have to do is basically repair data for the time that the node was down before it came back on another node.
How can external persistent storage help in all of this? For applications like MySQL and WordPress that don’t do their own replication, it can help to provide high availability for your services, and it basically makes sure that your down time is eliminated.
I’ll talk a little bit more about this in the next slides. Another advantage of using external storage providers like Portworx is that it helps you virtualize your storage so you can basically grow your compute and storage needs independently of each other. Before we talk about the scenarios, I just wanted to give a brief introduction about Portworx, because a lot of the scenarios that we talk about depend on a software storage solution like Portworx already installed on your cluster.
Portworx is the first production-ready software-defined storage designed from the ground up for microservices in mind. Using Portworx, you can provision and manage container-granular virtual devices and tight integration with schedulers and container orchestrators—basically help run your workload local to where the storage is provisioned.
Portworx is the first production-ready software-defined storage designed from the ground up for microservices in mind.
Apart from this, we also have a couple of other useful features that help manage your daily needs for storage. You can basically take a copy-on-write snapshot that can be restored from, in case you have any outage or any issue with the services that are using your volumes. You can also take something called cloud snap, which will basically back up your entire volumes into any object store outside your cluster. So basically, that object store can be any SC-compliant object store as your blob storage or Google cloud storage. We also allow encryption of volumes on either cluster level so you can have one cluster-wide encryption key or you can have encryption keys on a per-volume basis. This is very helpful in cases where you have multitenants and you don’t want to share encryption keys between different volumes and different services.
Another feature of Portworx is that everything is basically API driven, so everything can basically be automated. You don’t have to manually go in and either provision volumes or manage your data. You can basically—this is written for DevOps from the ground up, so everything is basically automatable, and you would never have to actually go in and manually do any kind of repair or recovery. And we actually run as a container ourselves, so it makes it very easy to deploy and maintain. And you must have heard about CSI in the past couple of days. And as soon as the schedule orchestrator has support for CSI, Portworx will also have support.
This is how basically Portworx would look once it’s deployed. Portworx would basically scan all the block devices, whether it is your direct data storage devices or an EVS volume or Azure-managed disks, and cast them out into one big cluster storage pool. And when your pods or your containers are spinning up, they would basically go ahead and request for thinly provisioned volumes from this cluster storage pool. And we would basically replicate data for these volumes across nodes, so that in case one of your nodes goes down you’re still able to bring up the container on another node with the same state.
Here, basically the orange part is what Portworx would comprise, and you would basically run different apps that would request for volumes from Portworx. I talked about block-level replication this is how basically it would work. All rights would basically be synchronously replicated across multiple nodes. And these replicas are actually accessible not just from the nodes where the data lies, but from any node where Portworx is installed—in case you want to scale your cluster where you have a few nodes that have more compute and memory and have separate nodes that have more storage.
You can always start a Portworx as storageless nodes to consume storage from the nodes that have more storage available. And in this case, what will happen is basically all rights would basically be synchronously replicated on to other nodes. And if a node goes down, basically you would still be able to use another replica from another node. And when the node that had gone down basically comes back up, Portworx in the background would repair the data to bring that replica back up to state with the current state of the volume. And if that node goes down permanently, what Portworx would do is basically we would re-replicate that data onto another node so that you always have the minimum replication that you had specified for your volume.
And when the node that had gone down basically comes back up, Portworx in the background would repair the data to bring that replica back up to state with the current state of the volume.
Now let’s talk about the recovery times with local volumes and external persistent volumes. This is mostly talking about the second type of applications, where applications do their own replication. Now if vendors use local storage they are basically pinned to that node. With local volumes, if a node crashes, the faster that the node comes back up online is better because then you have less data to repair. But in reality that is not always the case. Nodes sometimes could… You could take down nodes for maintenance cycles, and in that case your time period that the node is down could be large. So you will end up having to repair a large amount of data.
And nodes do take some time to come back up in case they are crashed. So that just extends the time that you would have to repair data for. But in case that the node fails permanently, this repair time can be even longer because you would basically have to re-replicate all your data from one of the other replicas in the cluster back to that node. And this could end up taking a lot of time. It could range from anywhere between hours to days. And while this is happening, like I had mentioned before, the throughput of your service will be affected. And also, you would end up using a lot of network throughput just to repair data back onto that node, and this can actually bring down your entire service if you basically end up using a lot of throughput just to do the repairs.
Now, let’s take a look at the recovery times with external storage. With the external volumes, basically your data is accessible from any node in the cluster. If a node does go down, you don’t have to wait for it to come back up to basically to bring your service back up. All that would need to happen is your scheduler would need to schedule that service onto another node and ask for it to use the same volume. And it would come back to its same state, and all you would have to do is basically repair data for the time that it took for the scheduler to reschedule your pod onto another node.
And this is actually similar in case a node dies permanently, too, because since there’s always another replica available on the cluster, you would just need to wait for the scheduler to schedule your container onto another node and it would… Again you would just need to repair the data back onto that node for the time that the node was down.
Okay, so here are some of the advantage of using Portworx or any software-defined storage like Portworx for your solution. Basically using a SAN or a NAS is an anti pattern in microservices, because by using a SAN or a NAS you will end up losing the flexibility that you have from microservices, because you’re basically swallowing your storage into a storage that’s outside your current cluster. This also introduces latencies, because for all the data that is to be returned to your storage, it needs to basically traverse from your computer cluster to your storage cluster. And it is also actually a one point of failure, so in case you lose that link between your computer cluster and your storage cluster, all your stateful services will go down at that point.
And actually having an external storage already increases complexities and failures when you’re dealing with node crashes and other scenarios that might happen in your cluster. Like I did mention, Portworx is built for microservices from the ground up. And one of things that we do is that we have tight integration with schedulers, so we can actually influence schedulers to co-locate your tasks with where the data is located, instead of basically having it located anywhere. And so we don’t share data across all the nodes we make it, so that our containers can take advantage of having data local to the node where they’re scheduled.
[Portworx has] tight integration with schedulers, so we can actually influence schedulers to co-locate your tasks with where the data is located, instead of basically having it located anywhere.
Another advantage is that there is a common solution for hybrid deployments. There are a lot of scenarios nowadays where companies have hybrid deployments where they have an on-prem cluster, but they want to basically burst into the cloud. Now, if you were not using a software-defined storage solution like Portworx, you would basically need to have auto machine and tools that were different for both your on-prem and cloud solutions. By having one unified layer of managing your storage, you can eliminate that and build your apps faster, rather than having to worry about managing your storage on different deployments.
Okay, so now you might ask, why not just use EBS directly? Like I pointed out, first of all EBS will work only on AWS, same thing with others like your managed disks or Google cloud storage. You would basically need to have different automation and tools for your hybrid deployments. Also, EBS has a limit of 16 volumes that you can attach to an EC2 instance, so in case you have PFEC2 instances and you want to work a lot of containers on that, you will end up hitting a limit. So you’ll not be able to tightly pack your services onto a node in that case.
Why not just use EBS directly? Like I pointed out, first of all EBS will work only on AWS, same thing with others like your managed disks or Google cloud storage.
One of the most common scenarios that you would run into with EBS is actually when you’re testing failover scenarios, you will see a lot of the time EBS volumes would get stuck in attaching or detaching state. This is problematic if you want to basically automate your entire environment, because somebody will have to manually go in and then detach your volume or attach it to the new node. And in fact, I was just talking to somebody earlier this week and they mentioned that one of the volumes got stuck in detaching state because they were using a volume plugin and there was no way out of it. The only thing that they could basically do was delete the EBS volumes, and that is a data loss situation that you don’t want to get into in production.
Another thing with EBS is the performance is not always up to mark. So you can always pay for provision diodes, but at that point you will end up spending a lot more money. And the last thing is failover is slow, so EBS wasn’t designed for the microservices in mind, so it wasn’t designed for scenarios where you would basically be failing over your containers or services very often. So it ends up taking a lot of time to fail over your EBS volumes from one node to another. There’s nothing wrong with using EBS as such; the only thing is you don’t want to be using EBS volumes directly with your containers. You want to have a layer on top of EBS that does the management of your container-granular volumes so that you can easily failover your containers, instead of having to depend on EBS doing the move for your volumes.
How can you use Portworx with your stateful services? There are three ways that you can do this right now. The first way is you can deploy simple services in Marathon using Portworx as the volume driver. And any applications in the DC/OS Universe that allow you to configure an external volume provider, you can basically specify Portworx as driver. The second way is you can basically deploy services that we’ve developed based on dcos-commons, which are also available in the Universe. I’ll go through these in detail later. And we also have the dcos-commons framework, which we’ve modified to be able to use Portworx volumes. You can always use that to develop your own services, too.
How can you use Portworx with your stateful services? There are three ways that you can do this right now.
This is an example of how you can use Marathon to use Portworx volumes. This is a simple example of MySQL container that you would spin up. All you would have to do is, in the parameters for your container, you would basically need to specify the volume driver as Portworx. And in the volume parameter, you would be able to specify the size for the volume, the replication factor, the name for the volume, as well as any other parameters that you want to use while creating the volume. And you don’t need to pre-provision volumes at all in this case. All of this would be dynamically done. As soon as you launch this, Mesos would basically try to spin up a Docker container; we would get a request to basically mount this volume. We would see that this volume has not been created. These options would then get passed to us, and we would dynamically create these volumes and mount it inside your container. And you can basically use a similar spec to use with Docker as well as UCR.
About the dcos-commons-based services that I was talking about: So we basically modified—enhanced—the dcos-commons framework to work with Portworx volumes. There are basically four services that are available through dcos-commons, and you can obviously write more, but the four services that are available are Cassandra, Hadoop, Elasticsearch, and Kafka. And we’ve actually added support for Portworx volumes and submitted these to the Universe. All you have to do is go to the Universe and search for Portworx and you would be able to install these four services with Portworx volumes backing the state.
We basically modified—enhanced—the dcos-commons framework to work with Portworx volumes.
We’ve actually made a couple more enhancements where we’ve allowed for the task to fail over between nodes, because there is nothing that’s pinning the tasks to a particular node now. Since there is Portworx backing these services, this basically means that your services will have a higher uptime—as well as your recovery times will be reduced by a large margin. We also made changes to the framework so that your volumes are co-located with your tasks, so this reduces latencies as well as network usage.
This is an example of a Hello World program. This is basically available in dcos-commons, and the path that I’ve highlighted is all you would need to change to use Portworx volumes instead of the default route and mount volumes. So dcos-commons is actually a great way to write stateful services. The only thing is right now before that, support for CSI, they only support route and mount disks, which basically means your services up into a particular node. In case that node goes down, that task is not going to spin up on another node because that data is available only on the node that went down.
So dcos-commons is actually a great way to write stateful services.
Like I mentioned, we’ve modified it to support Portworx volumes, and all you have to do is basically change the type of the volume from route mount to Docker, then specify the Docker volume driver as PXD, the Docker volume name which is the word name for the volume. Over there you can also specify the size… Not the size, use size as a different parameter, but you can specify the replication factor you want to use, as well as if you want to encrypt volumes and any other parameters that you would be able to pass to Portworx.
Basically you can specify any service as a spec like this and specify Docker volume drivers and Portworx to take advantage of Portworx. And all of this again… Volumes are all dynamically provisioned so you don’t have to go out of band or talk to your storage provider to provision volumes. All of this will be taken automatically when the service first comes up. The source code for this is available at github.com/portworks/dcos-commons in case you want to take a look.
All right, demo time. So what I’m going to show you is basically I’m going to install Portworx on DC/OS. I’m going to show you how easy it is to install DC/OS, and then I’m going to install Cassandra and go through a couple of scenarios that you could encounter in your production.
All right. Not sure what’s happening. Let me just open it up again, sorry.
Basically, Portworx as well as all the other services as specified are available in the Universe, so all you have to do is go and search for Portworx. We are going to select Portworx service. We have a six-node cluster, which is basically five private nodes and one public agent. And we are going to install Portworx on the five private agents. We are going to specify the management and data interface that we want to use, and just click Review and Deploy. And what this is going to end up doing is, it’s going to spin up an etcd cluster, and then it’s going to spin up. We use etcd to store our control plane state. It’s all going to spin that up. It’s going to spin up InfluxDB, which we use to store statistics. It’s also going to spin up LightHouse, which is our UI. And finally, it’s going to install Portworx on all the five private agents.
Portworx as well as all the other services as specified are available in the Universe, so all you have to do is go and search for Portworx.
I sped up the video a little because it ends up taking around five, six minutes to install. But as you can see, the etcd cluster got installed, and the proxy got installed, and InfluxDB got installed, and LightHouse got installed. Basically what’s happening now is the Portworx is getting installed on all the nodes. And if you go to Completed, you’ll see there are five tasks for the Portworx install that finished. Basically what I’m doing now is, I have searched into one of the private agents, and I’m just going to watch for the status from PX-card, which is our CLI, and wait for all the nodes to come up.
As you can see, three nodes have come up, and I have purposefully not attached a disk to one of the nodes. So you can see that all of these automatically get provisioned, get added to a cluster as either a storage node or a storageless node, depending on if there were block devices attached to that node or not. We are just going to pause it here for a second. The cluster is up now and as you can see, we have five nodes in the cluster. Four of these nodes have 400 GB disks attached to them, whereas one of them is basically a storageless node.
Now that the Portworx cluster is up, what we are going to do is, we are going to go ahead and spin up a three-node Cassandra cluster on top of this. Again, the Cassandra Portworx service is available in the Universe. I’m just going to show you that there are no volumes created initially, so you just need to run PX-card Volume List and you can see there are no volumes. Now we are going to go back into the catalog and search for the Portworx Cassandra service. And we don’t really have to change anything, but by default the options are not specified because it depends on what you want to provision it as. But we’re going to go ahead and specify that we want Portworx cluster with a replication factor of three. This is going to create three 10 GB volumes and attach them to each one of the Cassandra nodes.
All right. We just have to click Review Deploy and then Deploy. And we’re going to look at the state of the service. Again, I have sped it up a little because it takes time for Cassandra just to spin up to pull the artifacts. But as you saw, one of the nodes has spun up and it automatically provisioned the 10 GB volume with the replication factor of three and attached it to the node whose IP ends with 151, which we’ll see from the DC/OS UIs, where the Cassandra node had spun up. The second node comes up, and then the third node comes up. At this point, all three volumes will be provisioned. We’ll just do a Volume List again, and we see all three volumes have been created with a replication factor of three. All we had needed to do was specify the base name for the volume, and then for node zero, it created Cassandra Zero, Cassandra One, and then Cassandra Two.
Now that the cluster is up, one of the scenarios that I was talking about was what happens if your node fails. If you didn’t have something like Portworx running providing storage to your Cassandra cluster, the framework would keep retrying, waiting for the node to come back up. And it would not start it on another node, unless you manually went in and said that you want to replace a node.
If you didn’t have something like Portworx running providing storage to your Cassandra cluster, the framework would keep retrying, waiting for the node to come back up.
Now if you replace the node, you would need to run a Bootstrap and a repair command again, which could be expensive. What’s going to happen here is since we have replicated the data across nodes, as soon as you kill the node… I’m just going to go ahead and kill one of the nodes where the task is running. I’m just going to power it off. What we’re going to see is that the framework is going to realize that there’s nothing for this task that is pinning it to that node. And as you can see, DC/OS already figures out that node is offline, and the framework is also going to realize that the node has gone offline, there’s nothing…
All it requires from a node are CPU, memory, and disk and data that is not really pinned to a node. It’s going to basically go ahead and spin up that same Cassandra node onto another node, which was node ending with 131, which is A3. I’m just going to go back into A3 and we’ll just run PX-card Volume List, which will show us that the volume has now been attached onto this node. Yeah. If you check status, that’s the node that we actually took down. And if you do a Volume List, you’ll see that Cassandra Two, that’s the task that we had killed, is basically attached to the new node right now.
Now, one more thing you must have noticed is that I actually just allocated 10 GB of data to each one of these nodes. Now, if you’re running in a production cluster and you spun up this cluster and gave it to the production guys to use, you would soon realize that this is not enough. Now, if you were not using something like Portworx, what would need to happen is you would need to spin up an entire new cluster and allocate it more space. But with Portworx, all you would really need to do is… I’m just showing the output of DFO here, and you can see that the volume, which is mounted, under value as the amount Cassandra Zero has a size of 9.8 GB, roughly 10 GB. So, what we’re going to do here is we can actually dynamically resize this volume. So that you don’t really need to take your application or service offline. All you need to do is run once a simple command, and it will automatically resize your volume.
With Portworx… you don’t really need to take your application or service offline. All you need to do is run once a simple command, and it will automatically resize your volume.
All these commands that I’m showing, I’m using the CLI just to make it clear. But all of these are actually using the same REST API interface that you can automate against. Basically you can check status, you can dynamically… You can list your volumes, you can even provision your volumes. Even if you want to update the size of your volumes, all of that can be driven through APIs. So at this point, all an admin or a DevOps person would need to do is run a command or use the REST API to basically update the volume’s size. So I’m just going to basically update the size to 100 GB. And this just takes a few seconds. And if I take the FI [33:13] ____, you’ll see that this volume is now 100 GB volume. And this will basically be available to your application right away, so you don’t have to take it down or do any kind of maintenance for this. All right. I think I lost my slides. Yup there it is. All right.