MesosCon 2017: What Building Multiple Scalable DC/OS Deployments Taught Me About Running Stateful Services on DC/OS

We had a great time at MesosCon Europe 2017 this year. Nathan Shimek, VP of Client Solutions  at New Context, and Dinesh Israni, Senior Software Engineer at Portworx, gave a presentation at MesosCon Europe about “What Building Multiple Scalable DC/OS Deployments Taught Me About Running Stateful Services on DC/OS.”

The transcript of the MesosCon Europe presentation follows and you can watch a video of the presentation below.

 

Nathan Shimek:
Good afternoon. I’m Nathan Shimek, Vice President of Client Solutions at New Context. We are a San Francisco-based systems integrator that specializes in container implementations. I’m joined today by Dinesh Israni, Senior Software Engineer at Portworx. And today we’re here to talk to you about what building multiple scalable DC/OS deployments has taught us about running stateful services on DC/OS.

I would like to take a moment and thank the Linux Foundation for hosting the conference, Mesosphere for developing a great product for us to build on top of, sponsors, and all of you for showing up at a 5:00 on a Friday. So without further ado, let’s dive right in.

The containerization space as it exists today has myriad challenges. For a start, it’s relatively new. So the teams that are today being tasked to build and maintain platforms often just don’t have a huge amount of experience. Similar to the adoption of cloud technologies, there’s a real ramp that comes into learning and successfully building all of these things. As such, sufficient skills and experience are one of the things that you should really look for as you go forward.

There are areas where traditional skills don’t necessarily directly translate but need to be built upon. For example, in the networking arena, the recent addition of CNI, SDN, network overlays, etc. further complicate an already complex picture. So if it’s your expectation that you’re going to go from zero to production with a small team in a couple of months, that don’t have domain experience, it’s probably going to be pretty challenging.

That said, there is hope. Things are improving very rapidly, and the patterns for success in the space are quickly emerging, and the community is doing a lot to bring those forward.

So today, we’ll talk about four high-level areas and then dive a little bit deeper. First, we’ll look at platform availability overall and some of the key design decisions you should be thinking about to ensure your DC/OS implementation is resilient against failure. Next, we will look at some of the stickier points we both experienced, within and outside the cluster. And finally, we’ll review how organizations respond to these challenges and what has enabled them to find success for running stateful services in DC/OS.

So let’s take a look at platform availability. You’ll see that there’s a huge list of things that can be considered failure domains. Don’t consider these specific to containerization or DC/OS by any means. These are failure domains that you’ve probably seen in Amazon environment, maybe your virtualized infrastructure, and certainly it could be possibly in your bare metal infrastructure. In our experience, these are scenarios that given sufficient time, number of users, you are likely to see at some point in your environment. So they always happen all the time, it’s about how we design around those and to mitigate those risks that matter. At the end of the day, it’s our job to mitigate the impact of these outages and be able to [03:06: sound cut out…].

So when we get things wrong, and we do get things wrong, it can be dangerous and costly. That said, don’t lose hope. These are certainly not insurmountable challenges. It’s been something that we’ve really focused on over the last couple of years on improving in this space.

When you do have an issue, get in the habit of holding a blameless postmortem, then be sure to include how one could have identified a service interruption through monitoring of metrics, and include that as part of your discussion. Then actually test, or set up and test, the appropriate monitors and ensure that they behave as expected. If you’re unfamiliar with the concept of blameless postmortems, let’s have a quick conversation after this.

 

Actually diving into how to build a resilient platform, we’re of the opinion that you should design for production quality from the start.

 

Actually diving into how to build a resilient platform, we’re of the opinion that you should design for production quality from the start. That doesn’t mean that you have to build a production-level implementation during your POC days, but keeping that goal post further down is going to be really important.

It’s our experience that the difference in effort is comparatively small when you look at the challenges companies face when a POC implementation gets traction and suddenly you’re hosting revenue-generating systems on top of unstable or essentially just small-scale infrastructure. Invariably when that happens, platform stability issues emerge as the platform is just being asked to do things it’s not really designed to do, and users have bad experiences.

Additionally, when you design from the start with production-scale infrastructure in mind, you inform decisions that you’ll make at a later time as you have a specific lens to work with. Your automation and tool choices are very heavily impacted by the design, and over the medium term, you should actually be able to get further ahead, as your investment in automation from the start—around the ability to build and rebuild clusters easily with minimal impact—will greatly reduce your time and upgrade and allow you to rapidly iterate on your DC/OS infrastructure.

It’s, again, our experience that on small-scale implementations, you can experience more downtime with cluster rebuilds than a much larger one due to the typical approach of manual intervention in a POC environment and small-scale environment—versus a heavy focus on automation when you go to scale.

Some key points to think about from your automation efforts: Are your operators safe to terminate at least one node without any measurable impact? Yes? That’s great. What happens if three nodes go down? What if you answered “no” to that? What happens when you lose a node and you’re sitting at a talk at MesosCon at 17:06 on a Friday? Did your monitoring and metrics collection pick it up and automatically resolve that and just open up a ticket to let you know that something happened? Or did a developer who’s reliant upon services provided on top of DC/OS have to open a ticket internally? Or even worse yet, some end user waits 25 minutes, experiencing an outage, opens up a ticket with your company, and then you have SLA impacts. All of these things can by and large be mitigated with a proper design and implementation from the start.

Continuing down that scenario. So now we’ve got some nodes down, a customer has called in after 25 minutes, you’ve been paged out, and now you’ve got to open up your laptop, connect remotely, and take a look at what’s going on. Do you have the ability to bring back the failed nodes with a single command, easily executed, or do you have to actually dig in and do some manual intervention? Again, now we’re adding time.

All these things are relatively easily addressed, especially if you have the skill sets required to really take on the challenges associated with containers and stateful services within them.

Now, I’ll hand it over to Dinesh to talk about stateful services on storage.

Dinesh Israni:
Thanks, Nate. So, in this new age of DevOps tools, typically everything needs to be automated, because no one’s really got the time to log in and manually recover from failures. Also, this is not really possible at large scale because you don’t want one of your DevOps folks to basically be up at, like Nate said, at 5 pm on a Friday, to basically try to bring up a thousand nodes that went down and try to recover your data.

 

In this new age of DevOps tools, typically everything needs to be automated, because no one’s really got the time to log in and manually recover from failures.

 

You want to make sure that the storage solution you choose has good integration with schedulers. And if you’re using multiple schedulers, you want to make sure that they work across multiple schedulers so you don’t need to use multiple solutions with them.

For example, you also want to make sure that you’re able to efficiently schedule pods to be co-located with your data so that you get good performance for your pods or containers and don’t spend a lot of network bandwidth just sending data across your nodes.

On a large scale, you also don’t want to manually provision volumes every time a customer, whether it’s internal or external, needs to spin up new services—because that is just adding another layer of manual intervention, which is just not acceptable in this day of automation.

So, one of the things is you also want to make sure that you test various failure scenarios, and how schedulers deal with them with regard to storage, in order to avoid nasty surprises in production. So we at Portworx are actually working toward an open-source framework called Torpedo, which will help you validate these various failure scenarios to avoid just that.

The next thing that you should look at is how easy you are able to basically add or replace storage nodes and perform maintenance operations, because these are the kinds of operations that could basically result in downtime for your services. So you want to make sure that any storage solution that you choose minimizes or eliminates this kind of downtime.

For example, if you’re using autoscaling groups with Amazon, you need to figure out how that would affect you. Would the storage from your old nodes automatically be available to your new nodes? And if you wanted to add capacity to your storage solution, are you able to scale up your current nodes, or would you basically be able to add new nodes to scale up your cluster either?

Another thing to keep in mind is how your services would work in hybrid cloud deployments, because you don’t want to be building tools in automation for different types of environments that you have. You want to have one way of doing things across multiple environments. So, for that, you basically want to use a cloud-native storage solution like Portworx to make sure that it is easy to manage and deploy your storage in one way.

 

You want to have one way of doing things across multiple environments. So, for that, you basically want to use a cloud-native storage solution like Portworx to make sure that it is easy to manage and deploy your storage in one way.

 

You don’t have to have multiple automation frameworks and tools to manage different deployments in that way, like Nathan pointed out. Also, you want to aim for highly available data, as Nathan pointed out earlier, because you don’t want to run into production and then figure out that, oh, you lost a node and then you are not able to bring the same services up because you failed to replicate your data.

Another thing that you want to make sure is that your storage solution is automatically able to place replicas across failure domains, so that you’re always able to bring up your service even if an entire rack goes down. So this would actually require your storage solution to be intelligent enough to figure out where they are located and automatically place data in different availability zones when they’re provisioned.

Finally, you want to make sure that when the time does come to upgrade your software solution, you don’t have to bring down your entire cluster. You want to make sure that there is a way to perform in-place rolling upgrades to minimize disruptions. Again, this sometimes requires integrations with schedulers to let them know that your storage is going to be down on a particular node, so that it should not schedule any containers onto that node while the upgrade is in process.

So, I’m going to hand it back to Nathan now to talk about the test for the failure scenarios that I alluded to.

Nathan Shimek:
Great. Thanks, Dinesh. Testing is key in our world. Today, there are a number of companies, like Portworx with Torpedo and Netflix with Chaos Monkey, that are building in open-sourcing tools that allow you to simulate real-world outages for various services. Ideally, you would eventually mature that to actually running in your production environment, but on the path there, I would suggest building a production-like environment. So a minimal-scale implementation that follows the same clustering topology, network topology, etc. as your actual production infrastructure, and run it there. Doing so will likely expose gaps in monitoring, responding, or response times, any number of areas that you are going to hit in production. This would just be a less costly way to find it and patch that up.

So, again, develop metrics and monitoring that align to the failure scenarios that you see most commonly and are most impactful. In the world today, it’s incredibly easy to implement the tool, check a bunch of checkboxes, and just get totally inundated with the data that’s delivered to you and it becomes un-actionable. Really focus on what impacts you and how to respond to that.

Additionally, when things break, and they will, it’s really important to limit the blast radius. The last thing you want to have happen is a cascading failure, which takes significant downtime and effort to recover from. If we started with an HA design and implementation, have focused on automation, we’ve already taken significant steps to reduce the impact of single-zone outages. And we can further contain the likelihood of that happening by isolating user applications from each other, platform services from users, and platform services from each other.

 

When things break, and they will, it’s really important to limit the blast radius. The last thing you want to have happen is a cascading failure, which takes significant downtime and effort to recover from.

 

For example, if a platform service needs ZooKeeper, then the ZooKeeper instance of the service that it’s linked to should not be accessible from the platform users. Isolating platform services from the user space will help ensure platform resiliency in the face of application issues. Additionally, sandboxing platform services will help avoid everything from noisy neighbor problems and resource consumption issues to cascading failures.

At the end of the day, infrastructure is multi-disciplinary and cross-functional, and DC/OS is no different. You really need expertise in security, compliance, containers specifically, compute, storage, networking, automation, CI, on and on and on. We’re not yet to the point where we’ve fully converged those skill sets, so find people with experience in that space and bring them in. The days of having a compute team, and a network team, and a storage team don’t really align with the model of DC/OS, and nor do they align in, really, modern operation models in general. DevOps has pretty fundamentally changed the space and you should look to a lot of the learnings from there.

 

Infrastructure is multi-disciplinary and cross-functional, and DC/OS is no different. You really need expertise in security, compliance, containers specifically, compute, storage, networking, automation, CI, on and on and on.

 

So now let’s take a look at what’s happening within the cluster. And with that, I’ll hand it back to Dinesh.

Dinesh Israni:
So once you have your cluster up and running, you will realize that your needs will change over time, either because the apps that you use will change, the skill that you run them at will change, or it’s just the ever-evolving tech that you’re involved with. So, in such scenarios, you don’t want to tear down your volumes or cluster and reinstall everything to be able to deal with your new requirements.

For example, if you provisioned a 100 GB volume for an application, but the demand and use for that application far exceeded your expectation and you now need to allocate more space to it: Do you want to provision another volume and move data over from the old volume? You don’t. The ideal way you would want to do this is do it in real time without having any downtime for your services. And you will eventually hit a point where you will need to add more storage to your storage solution. Again, you would want to make sure that the solution that you have chosen allows you to do this seamlessly by either adding disks to nodes or adding new nodes, as I had mentioned earlier.

You also want to make sure that you understand your customers’ needs with regards to backup and archiving data. For example, you want to set up schedules to take regular snapshots automatically, and also archive your data outside your cluster in cases of disaster, so that you can basically recover from that.

For example, with Portworx, you can do this by setting up snapshots scheduled at a container-granular level, and also take cloud snaps which can back up your data to either S3, Azure Blob, or Google Cloud Storage. So in a case of disaster, all you would need to do is basically restore from that cloud snap and reconfigure your apps, and you will be up and running with your service.

 

You want to set up schedules to take regular snapshots automatically, and also archive your data outside your cluster in cases of disaster. For example, with Portworx, you can do this by setting up snapshots scheduled at a container-granular level, and also take cloud snaps which can back up your data to either S3, Azure Blob, or Google Cloud Storage. So in a case of disaster, all you would need to do is basically restore from that cloud snap and reconfigure your apps, and you will be up and running with your service.

 

You also need to understand your security needs based on the service you are running. For example, how is your data stored at rest as well as in transit? Depending on the industry you are in, there might be regulations, and you want to make sure that you can enable encryption for both these cases.

Lastly, you also want to make sure that you can monitor the health of your storage solution and receive alerts in case of impending doom, so that you can proactively take measures to avoid downtime. And today with tools like Prometheus and Grafana, there is really no excuse for storage solutions not to provide such integrations.

So I’m going to hand it back over to Nathan to talk about some of the platform security stuff that you can tackle.

Nathan Shimek:
Security within the containerization realm, and security in general, is a much broader and deeper topic than we’ll have time to really go into today, but I figured we’d just do a couple of quick hits. Patterns for both attacking and defending containers are evolving rapidly. There are several open-source software initiatives creating patterns to attempt to address the space, but the bulk of the progress really has been made in the enterprise software realm.

There are a few things that you can do today, probably with relatively low cost, by either deploying new tools or tweaking existing tools to take advantage of some of the security improvements.

Most people here are probably aware of the CIS Docker benchmark. I think it provides value. It’s one of the things that we integrate as part of CI on a very, very regular basis. Additionally, you can look at container image signing, real-time vulnerability scanning, and compliance control enforcement and monitoring through something like NSpec and test-driven development. Again, I’m happy to talk about all these things over beers afterwards, but each one of these things warrants probably a multi-day track, so we’ll just skip through that a little bit quickly.

Now, to the fun part: really operationalizing things. At the end of the day, you will always need to maintain what you build. And maintenance encompasses version-to-version upgrades, major upgrades, accommodating, breaking changes, etc.

 

At the end of the day, you will always need to maintain what you build.

 

Cluster maintenance and upgrades have become significantly easier in the DC/OS world. And if you’ve been using it for 18 months or more, you know this. And based off of my kind of rough understanding of the product roadmap for DC/OS, there are some significant improvements in 1.11. I’m sure that the people out at the Mesosphere booth would be happy to run you through the product roadmap on that front.

Even with good controls and training, users will still find a way to break things, lock up resources, and otherwise just cause havoc in the cluster. Occasional frozen jobs, runaway ops, and open tasks just happen. It’s the name of the game. Planning ahead for these issues will really make your life much easier.

Now, let’s take a look at how we handle some of the challenges with externalizing services, which are built and running in DC/OS. As I kind of alluded to earlier, networking is one of the more tricky areas given the additional complexities added by network overlays, CNI, SDN, etc.

Clusters today are really well-designed for internal traffic. Apps talking to apps in the cluster is highly reliable, well understood, and overall pretty trivial. The real challenges, in our experience, come from when you need to wire in to existing infrastructure and externalize a service. For example, do you have an IPAM tool in place today in your company? Does it provide an API that is easy to automate against? If not, do you have to adopt a new tool for your company to do IP address management, or do you carve out some subset just for your containerization environment? When you start to add in things like IP per container, this conversation becomes much more complex.

So in this realm, as well as with service discovery and load balancing, what you have in place today is going to largely inform what you do with your DC/OS implementation. I’m sure we all have opinions on what you would want to do in a greenfield environment, when it comes to load balancing, service discovery, etc.

It’s my experience that I haven’t been particularly lucky as a consultant in the ability to start from a greenfield. If you have those projects going, that’s really cool and I’d love to talk to you about those and hear what your thoughts are. But again, for me, the name of the game here is really how do we integrate into existing environments, and then move stateful services that are today running on either bare metal or in a virtualized environment into containers.

So last, but certainly not least, let’s talk about organizational. As I’ve alluded to, I actually believe that this is kind of the chief metric on whether or not you’re going to be successful in your DC/OS implementation, and I would say containerization in general. The team who leads the internal container initiative really will define its success or lack thereof. It’s our experience that they need to bring themselves, their peers, and the internal developer community up to speed on all of these new technologies, patterns, etc., pretty much in parallel. And that’s a pretty difficult task.

As such, one of the ways to kind of ease the burden here is to engage people early, probably your developer community first off, and really get to understand what their requirements are. At the end of the day, you’re building a platform for services, and if you’re not providing services that are consumable or of interest to the people building software on top of that, what are you doing it for?

 

You’re building a platform for services, and if you’re not providing services that are consumable or of interest to the people building software on top of that, what are you doing it for?

 

So I’m a strong proponent of thinking of this as a software project more so than a traditional infrastructure project. I always handle this in an agile fashion, do some requirements gathering, work very rapidly and iteratively to provide value as quickly as possible. That way, assuming they have a good experience, assuming the platform’s available and resilient and provides services that people are interested in, they’re typically going to use it. And then once you have adoption, hopefully you can turn those people into evangelists to turn other people in your community on to the platform you’re now providing.

Too often, I certainly see a small-scale implementation that doesn’t really look at what they’re trying to service and they don’t get adoption and wonder why. It’s not one of those things that if you build it, they will come. It’s if you make it easy to use and attractive, they might use it. But certainly if you just make it difficult to use or are not providing any value, they’re not going to engage with you.

As Dinesh touched on, there are a number of guardrails that do need to be built, especially when it comes to reasoning about data services, guardrails in general. At the end of the day, there are data sovereignty laws which can be as granular as a local level, but certainly at a state and national level exist. As Dinesh pointed out, that can be something as simple as encryption.

Or if you’re in a multi-region implementation and accidentally decide or purposely decide to replicate personally identifiable information, for example out of the European Union into the United States or vice versa, you have now really gone off the rails, and your internal controls and compliance organization is not going to be happy with you. Unfortunately, it’s trivially easy to do that from a technology perspective, and it can have huge ramifications on your company from a legal perspective. That’s not a great conversation anyone wants to have with your CIO or your internal legal general counsel.

So, look at internal controls on how… Engage with your internal controls group to see what you can do, what you should be doing. Additionally, be mindful of any industry-specific controls. In the United States, we have HIPAA for healthcare, we have SOC, and a number of things. And they reason about what we need to do, what our responsibilities are, and regardless of the platform that we’re delivering our services upon, it is our responsibility to do that. So, at the end of the day, have some conversations, stay in compliance, make everybody’s lives easier.

On the skill-set area, there are some ways to just engender growth and adoption. First and foremost, hopefully, if you have to do some external recruiting, find some experienced engineers who have run through this or worked on the platform. They will be a great asset to you. If you don’t have that, not a problem, there’s a wide community. There’s Slack channels and GitHub and a million places, conferences like you’re at now, to really find people, engage with and learn from them.

Additionally, especially early on, I’m a big fan of creating an operational playground, both for the platform engineering side as well as the developer side. It’s my experience that I need to be able to figure how to break a cluster, break clusters unintentionally, rapidly iterate on automation to rebuild things. And if I’m doing that while developers are attempting to learn how to use the platform, I’m negatively impacting their experience.

So if you have enough compute resources available, just give yourself your own playground, give dev their own playground, and eventually you can probably get to the point where you’re mature enough that you can consolidate those things. But off the break, I would definitely start there.

Additionally, just general notes from agile: fail fast and fail often. It’s totally okay. This is a learning experience for many of us.

And finally, if you want to make this more attractive and drive some internal adoption, set up a hack day. Figure out what makes sense, what problems are you trying to face.

So now you have a great way to help… Or a great way to help on all these fronts is to focus on training. After you’ve found some internal advocates and evangelists who are familiar and can drive excitement within your organization, and once you have a base level of engagement coupled with providing developers and engineers with environments where they can learn, you should be able to rapidly iterate an experiment to drive adoption.

Around somewhere at this point, I would suggest investing in formalized training and then use that experience of formalized training to build training that actually matters to you. At the end of the day, stateful services is a very expanding field. So what makes sense for a company looking at implementing Cassandra versus something else might not be the same. Really, find what the… How to bring up the skill sets that are applicable to your organization.

At the end of the day, running stateful services and containers is not trivial. DC/OS and Portworx are making it significantly easier, but it still requires expertise in a wide range of areas to successfully do that. And finding experienced advocates and evangelists in your organization will really help.

 

Running stateful services and containers is not trivial. DC/OS and Portworx are making it significantly easier, but it still requires expertise in a wide range of areas to successfully do that.

 

So by pulling all these things together, you should find that you’re fostering the right skills to make your platform attractive and available, and eventually, or hopefully soon, getting to production-level services. So, with that, we’re pretty much done. Any questions?

Okay. Thank you very much.