Stork is the Portworx’s storage scheduler for Kubernetes that helps achieve even tighter integration of…
May 17, 2023
Drive Simplicity with a Kubernetes Data Platform
Kubernetes data platforms provide a simple way to manage complex tasks at scale. Platform Engineers have become the glue that holds technical operations together. They need to provide a seamless experience for developers to get their product into production, all while ensuring that deployed applications meet our company’s SLAs.
Organizations want their developers to be focused on their code rather than figuring out how to deploy everything. This job is increasingly difficult based on the way modern applications are created. Instead of deploying a few virtual machines that run your application and database, we’re now dealing with many databases per application, and the applications are broken into individual microservices. This is fantastic for developers because they have a lot of choices for how their apps are built, but with all of these options, we’re adding a lot of complexity for our Platform Engineers to manage, and complexity can cause a lot of problems. Complex systems require more thought to add new services or maintain older services, and that slows innovation. They require more difficult operational procedures to orchestrate all the tasks, and finding the root cause of issues when things break may be very time consuming.
Kubernetes has played a big role in reducing complexity at scale. Yes, I said Kubernetes REDUCES complexity. I know that might sound counterintuitive to many of you trying to learn Kubernetes for the first time, but if you had ever tried to manage microservices at scale using just containers, you would know how important Kubernetes as a platform is to reducing this complexity. Kubernetes provides a consistent application platform that can be integrated with cloud providers to provide a consistent experience—no matter what cloud it runs on. Kubernetes takes all these options and provides us with some consistency, which limits the number of variables that Platform Engineers need to manage.
Kubernetes has been a great solution as an application platform, and it gets better with each release. But it’s missing some key features for running stateful applications in this same simplicity model. Platform Engineers are still struggling with multiple tools for handling backups, disaster recovery, self-service, and a myriad of other features that are required for an enterprise to deploy an application into production. Platform Engineers need a data platform to go with their Kubernetes-based application.
Kubernetes clusters are not just for stateless applications anymore—they are brimming with data. We store application data, cache data, tier data, replicated data, backup data, and sometimes archive data. Our applications are storing data in block storage, file storage, object storage, message queues, streams, relational databases, and non-relational databases. Microservice-based services often may consume many of these solutions in a single application! Not only are our applications collecting more data than ever before, but they are also generating more data than ever before, including log data and telemetry data in addition to performance metrics.
Storing this data is a challenge in and of itself, but enterprises will also require data services on this data to keep it flexible, secure, protected and available. With all of the complexity of managing data at scale in a modern application environment, our Platform Engineers need a data platform that seamlessly integrates with their application platform.
Characteristics of a Data Platform
Kubernetes provides a platform to consistently run applications in a standardized way. The clusters provide a base set of services that you can depend on to run your applications. You know that a scheduler will place your pod on the right node. You know that Kubernetes services will do service discovery and load balancing for your applications. You know that you have a secret store that you can use to store credentials for your applications.
Similarly, a Kubernetes data platform provides a common set of necessary services for managing data. In short, it solves management problems for Platform Engineers. Data services running on Kubernetes have very similar needs to bare metal or virtualized applications. As you look through this list of data platform capabilities, imagine trying to push a legacy application running on bare metal or a virtual machine into production without some of these capabilities.
Storage Infrastructure: The most basic capability a data platform should provide is storage infrastructure. This is the system that will store your data for any of the applications that you are running. The storage infrastructure should have data capacity that can be presented to your applications through file storage, block storage, and/or object storage. A data platform must provide a way for your applications to access this capacity.
Application Availability: Kubernetes data platforms need to ensure that data will be available during a failure or upgrade. Hardware fails, software panics, and anything that can go wrong will go wrong. Kubernetes can reschedule pods across nodes, but can your Kubernetes platform make sure your data is also available across nodes? A production cluster needs to provide a way to make applications that depend on state to be highly available when a node or failure domain goes offline. Just because your applications can fail over between zones, it does not necessarily mean that your data can as well. Be sure your data platform provides zone awareness to keep your applications available at all times.
Disaster Recovery: High availability is our first line of defense for failures, but some failures are bigger than others. When the unthinkable occurs, you need a reliable way to restore services. Applications can be redeployed from version control to new Kubernetes clusters, but persistent data cannot. Data platforms provide a way to replicate data over long distances to protect against regional outages. The most mature data platforms provide zero data loss solutions for the most critical applications that cannot accept dropped transactions.
Data Protection: Our data might be one of the company’s most precious assets. It can be a target for ransomware attacks, or it could simply be lost from normal bit rot. When we consider the criticality of the data to our company and factor in the risks of losing it, we can see that data platforms must provide a mechanism to make copies of our data in case we need it later. Snapshots are an efficient way to protect our data from corruption, and backups provide a way to archive our data for longer-term retention. With the risk of ransomware attacks targeting backups, it is important to consider immutable backups for critical data to protect the backup data. A data platform should ensure that data is protected against both data loss and attacks from ransomware actors by providing immutable backups.
Resource Utilization: Data platforms have to do a lot to work seamlessly. Like a duck on a pond, it might seem like a data platform is effortlessly providing data management capabilities, but under the surface there is a lot of activity. Scheduling snapshots and backups, managing replication, providing self-service, and scaling storage all take resources. A good data platform can provide these capabilities without requiring too many resources of its own. A good data platform should save you more time and resources than it takes to run your data services.
Data Portability: Containers made it really simple for us to move our applications around. Containers are lightweight, encapsulated environments that include all the dependencies and libraries to run an application. A data platform should provide ways for us to also move the data for these applications to where it delivers the greatest benefits to our organization. Without having data portability, our applications are really tethered to where the data is located and become less portable. Making data portable allows users to copy data between production and development environments for testing, move applications and data between clouds, and just be generally flexible for changing business requirements.
Security, Encryption and RBAC: Our customers who provide personal or confidential information expect their data to be handled responsibly. They have entrusted the data platform with that sensitive information, and it needs to ensure that we are preventing access to curious eyes or accidental dissemination of that data to people who shouldn’t have it.
Role-based access controls are a first line of defense to prevent bad actors from getting sensitive data. When there are fewer users that have access to data, there are fewer attack vectors that bad actors have to access this data. In addition, most employees don’t need access to this sensitive data either, so limiting the number of eyes that can see our data is a great start to protecting it. Mature data platforms even provide multi-tenant-capable solutions that can segment access between teams or organizations.
Encryption is another measure that helps protect sensitive data from unauthorized access, theft, and exploitation. When data is encrypted, it is converted into a format that is unreadable and meaningless without the correct decryption key. Data platforms should make the encryption and decryption of data seamless to maintain normal operations while always protecting it. This also comes in two flavors: cluster-wide encryption or volume-based encryption if tighter security is desired.
Capacity Management: Data always grows. The one thing we can be pretty sure of in this world is that today is the day with the least amount of data we will be managing. Tomorrow there will surely be more. We constantly need to ensure that our data services have enough capacity and resources to continue operating without discarding data. With the number of stateful data applications growing so rapidly, we need better tools to manage their capacity and grow our data storage when necessary. Data platforms can leverage automation to scale our data services without application downtime and without user intervention. The lack of capacity management capabilities can lead to higher storage costs due to overprovisioning. You can leverage a good Kubernetes data platform solution to reduce your overall storage costs.
Developer Agility: We should not run Kubernetes data platforms on service tickets. We need data services at deployment time—to test out an idea or to get new services online. We need it right now, and the more queues we have to wait in, the longer it will take for our ideas to get out to our customers. Kubernetes provides the APIs to request services, and a data platform should be able to provide storage services based on those requests. A Kubernetes data platform commonly provides block, file, and/or object services, which can be accessed using standard storage classes. Administrators should have the ability to standardize what platform capabilities are available through Kubernetes storage classes so guard rails can be put around how storage is accessed and used. Developers should have access to use these storage classes to access the data platform in a self-service fashion when they need it without breaking the environment and while staying within company standards.
Storage Performance: A data platform has to provide great storage performance. Secure, reliable, highly available data which is slow to access is still useless to us. A highly performant storage solution is one of the most common expectations that we can place on a data platform. The performance of a storage system can have a significant impact on customer experiences, and we should treat it accordingly. Data platforms need to take advantage of the speed of newer technologies like NVMe storage devices for the most extreme performance needs.
Data platforms should be flexible to accommodate the different types of I/O profiles the applications will be running on them. Consistent performance for many types of applications in a single solution reduces the management burden put on administrators.
Cost Management: There is always a give and take between balancing performance, stability, reliability, availability, and the costs associated with these capabilities. Not every application needs to run on the fastest storage devices we have available. Not every application needs to have storage capacities over-allocated by 50%. Not every application needs to have three copies of data spread across availability zones. A data platform should, of course, provide these capabilities, but a good data platform provides options for applications that do not need all of these features and can save costs instead. Thin-provisioning data stores and growing them as needed can save costs for bulk storage. Providing multiple tiers of disk can save costs, instead of always using premium storage. And replicating non-necessary data across availability zones can be costly, so a data platform should provide options to pick and choose when to provide this availability based on the application.
Reduce Complexity: Lastly, a data services platform should reduce the overall complexity of your environment. There are a lot of data services that you can purchase for individual projects. You might choose a managed cloud database for one app, a cloud file sharing solution for another app, and a self-managed database for a third app. Every time you add a new solution, it requires additional overhead in the form of access controls, cost governance, capacity management, auditing, lifecycle management, etc. Humans—who probably already have too much work to do—usually manage these tasks. Adding bespoke data services for each of our applications makes this a complex set of tasks for a company to manage, as well as increasing costs for training and personnel. A Kubernetes data platform should offer solutions to simplify these tasks for administrators by centralizing the solutions and making them self-service where possible.
Simplify with Portworx
Kubernetes data platforms aren’t too different from your Kubernetes application platform. They provide a set of common services that teams can depend on to reduce the management and overhead of storing data. Their goal is to simplify the task of securing, storing, accessing, and protecting data so that users (or developers) have a better experience when working with data in your organization. Removing these burdens from Platform Engineers gives them the opportunity to work on larger business problems rather than the toil of managing data requirements. If you’re evaluating a data services platform, be sure to ask if they can provide the functionality that we’ve discussed here today.
Back to Blog