Automated capacity management using Portworx Autopilot and AWS Karpenter

Platform
- Explore the PlatformThe Portworx PlatformThe cloud-native storage software platform for Kubernetes that saves costs while delivering storage at scale.
Pricing
Solutions
- Learn MoreSolutionsAutomate, protect, and unify data for Kubernetes applications
Resources
- View all ResourcesResourcesLearn more about Portworx products, solutions and services through these analyst and technical reports, white papers and case studies.
Support
- View all DocsTechnical DocumentationCheck out the docs portal for step-by-step guides and tips on how to use Portworx.
- Read the RadarDownload the 2023 GigaOm Radar for Cloud-native storage to see why Portworx is the leader for the 4th consecutive year.
Company
- Learn MoreAbout UsLearn about Portworx, the leading Kubernetes Data Services Platform enterprises trust to run mission-critical applications in containers in production.

Schedule a Demo

How To

shutterstock

Kubernetes does a great job of orchestrating your containerized applications and deploying them across all the worker nodes in a cluster. If a node has enough compute capacity to run a specific Kubernetes pod, Kubernetes will schedule that application pod on that worker node. But what if none of the worker nodes in the cluster have enough available capacity to accept new application pods? At this point, Kubernetes will not be able to deploy your application pods, and they will be stuck in a pending state. In addition to this scenario, Kubernetes also does not have the capability to monitor and manage the storage utilization in your cluster. These are two huge problems when it comes to running applications on Kubernetes. This blog covers solutions from Portworx and AWS that will help users architect a solution that helps them remediate these concerns.

In this blog, we will look at how Portworx Autopilot and AWS Karpenter work together to help users build a solution on top of AWS EKS clusters that allows them to automatically expand persistent volumes or add more storage capacity to the Kubernetes clusters using Portworx Autopilot. The solution also automatically adds more CPU and memory resources by dynamically adding more worker nodes in the AWS EKS cluster using AWS Karpenter.

We can begin by developing a better understanding of automated storage capacity management with Portworx Autopilot. Autopilot is a rule-based engine that responds to changes from a monitoring source. Autopilot allows you to specify monitoring conditions as well as the actions it should take when those conditions occur, which means you can set simple IFTT rules against your EKS cluster and have Autopilot automatically perform a certain action for you if a certain condition has been met. Portworx Autopilot supports the following three use cases:

Automatically resizing PVCs when they are running out of capacity
Scaling Portworx storage pools to accommodate increasing usage
Rebalancing volumes across Portworx storage pools when they come unbalanced

To get started with Portworx Autopilot, first you will have to deploy Portworx on your Amazon EKS cluster and configure Prometheus and Grafana for monitoring. Once you have that up and running, use the following steps to configure Autopilot and create an Autopilot rule that will monitor the capacity utilization of a persistent volume and scale it up accordingly:

Use the following yaml file to deploy Portworx Autopilot on your Amazon EKS cluster. Verify the Prometheus endpoint set in the autopilot-config config map and ensure that it matches the service endpoint for Prometheus in your cluster.

# SOURCE: https://install.portworx.com/?comp=autopilot
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: autopilot-config
  namespace: kube-system
data:
  config.yaml: |-
    providers:
       - name: default
         type: prometheus
         params: url=http://px-prometheus:9090
    min_poll_interval: 2
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: autopilot-account
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  labels:
    tier: control-plane
  name: autopilot
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: autopilot
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  replicas: 1
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: autopilot
        tier: control-plane
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "name"
                    operator: In
                    values:
                    - autopilot
              topologyKey: "kubernetes.io/hostname"
      hostPID: false
      containers:
      - command:
        - /autopilot
        - -f
        - ./etc/config/config.yaml
        - -log-level
        - debug
        imagePullPolicy: Always
        image: portworx/autopilot:1.3.1
        resources:
          requests:
            cpu: '0.1'
        securityContext:
          privileged: false
        name: autopilot
        volumeMounts:
        - name: config-volume
          mountPath: /etc/config
      serviceAccountName: autopilot-account
      volumes:
        - name: config-volume
          configMap:
            name: autopilot-config
            items:
            - key: config.yaml
              path: config.yaml
---
apiVersion: v1
kind: Service
metadata:
  name: autopilot
  namespace: kube-system
  labels:
    name: autopilot-service
spec:
  ports:
    - name: autopilot
      protocol: TCP
      port: 9628
  selector:
    name: autopilot
    tier: control-plane
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: autopilot-role
rules:
  - apiGroups: ["*"]
    resources: ["*"]
    verbs: ["*"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: autopilot-role-binding
subjects:
- kind: ServiceAccount
  name: autopilot-account
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: autopilot-role
  apiGroup: rbac.authorization.k8s.io

Once you apply this configuration, you can start creating AutopilotRules for individual applications or namespaces. An AutopilotRule has four main sections:
- Selector: Matches labels on the objects that the rule should monitor.
- Namespace Selector: Matches labels on the Kubernetes namespaces the rule should monitor. This is optional, and the default is all namespaces.
- Conditions: These are the metrics for the objects to monitor.
- Actions: These are what Autopilot will perform once the metric conditions are met.

Here is an example of an AutopilotRule that checks for a persistent volume with a label of app: postgres deployed in a namespace with a label of type: db. If the used capacity exceeds 50%, it doubles the size of the persistent volume untill it hits a maximum size of 400Gi.

apiVersion: autopilot.libopenstorage.org/v1alpha1
kind: AutopilotRule
metadata:
  name: volume-resize
spec:
  selector:
    matchLabels:
      app: postgres
  namespaceSelector:
    matchLabels:
      type: db
  conditions:
    expressions:
    - key: "100 * (px_volume_usage_bytes / px_volume_capacity_bytes)"
      operator: Gt
      values:
        - "50"
  actions:
  - name: openstorage.io.action.volume/resize
    params:
      scalepercentage: "100"
      maxsize: "400Gi"

Once you apply this AutopilotRule specification, Portworx will monitor the capacity utilization using Prometheus metrics for that persistent volume and automatically perform actions as needed.

In addition to expanding individual persistent volumes, Portworx Autopilot also allows you to put AutopilotRules in place that will direct Portworx to automatically expand the underlying storage pool if your applications are more storage intensive and you don’t want to add more EKS worker nodes in your cluster. Below is a sample AutopilotRule that monitors your storage pool utilization. If the available capacity falls below 50% and if the total capacity is still less than 2TB, it automatically creates and attaches EBS volumes to your EKS worker node to add 50% more capacity of your storage pool:

apiVersion: autopilot.libopenstorage.org/v1alpha1
kind: AutopilotRule
metadata:
  name: pool-expand
spec:
  enforcement: required
  ##### conditions are the symptoms to evaluate. All conditions are AND'ed
  conditions:
    expressions:
    # pool available capacity less than 50%
    - key: "100 * ( px_pool_stats_available_bytes/ px_pool_stats_total_bytes)"
      operator: Lt
      values:
        - "50"
    # pool total capacity should not exceed 2TB
    - key: "px_pool_stats_total_bytes/(1024*1024*1024)"
      operator: Lt
      values:
       - "2000"
  ##### action to perform when condition is true
  actions:
    - name: "openstorage.io.action.storagepool/expand"
      params:
        # resize pool by scalepercentage of current size
        scalepercentage: "50"
        # when scaling, add disks to the pool
        scaletype: "add-disk"

Now that you know how to automate storage capacity management, we can look at how we can leverage AWS Karpenter to add more nodes to our EKS cluster when we need more compute capacity for our application pods.

Use the eksctl instructions on Karpenter’s documentation site to deploy an EKS cluster. For our testing, we used the following eksctl configuration to deploy an EKS cluster with two different node groups. The first node group, “storage-nodes,” will include three nodes that will be used by Portworx to provide storage for your stateful applications. The second node group, “kar-bshah-ng,” will be used by AWS Karpenter to dynamically add more nodes to the EKS cluster to increase the amount of compute capacity available for your applications.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: kar-bshah
  region: us-west-2
  version: "1.21"
  tags:
    karpenter.sh/discovery: kar-bshah
managedNodeGroups:
  - name: storage-nodes
    instanceType: m5.xlarge
    minSize: 3
    maxSize: 3
    desiredCapacity: 3
    volumeSize: 100
    amiFamily: AmazonLinux2
    labels: {role: worker, "portworx.io/node-type": "storage"}
    tags:
      nodegroup-role: worker
    iam:
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/ElasticLoadBalancingFullAccess
        - arn:aws:iam::<<aws-account-id>>:policy/<<px-role>>
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        ebs: true
        fsx: true
        efs: true
        albIngress: true
        cloudWatch: true
  - instanceType: m5.large
    amiFamily: AmazonLinux2
    name: kar-bshah-ng
    desiredCapacity: 1
    minSize: 1
    maxSize: 10
    iam:
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/ElasticLoadBalancingFullAccess
        - arn:aws:iam::<<aws-account-id>>:policy/<<px-role>>
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        ebs: true
        fsx: true
        efs: true
        albIngress: true
        cloudWatch: true

Once you have your EKS cluster deployed, use the steps on the documentation site to configure AWS Karpenter and have it add more nodes to your EKS cluster when you need more CPU and memory resources for your application.

If you want to see Portworx Autopilot and AWS Karpenter in action, watch the following video, where we demonstrate how you can scale your compute and storage capacity as and when you need it.