How to install Kubeflow on Amazon EKS with Portworx

Contact Sales

How To

Machine learning (ML) has grown in popularity over the past five years, becoming a crucial part of almost every industry you could think of, including healthcare, finance, entertainment, retail, and manufacturing. Machine learning is a branch of artificial intelligence that focuses on building applications that can automatically and periodically learn and improve from experience without being explicitly programmed to make specific adjustments. Machine learning models are built and deployed by following a simple three-step process: data preparation, model training, and deployment and inference. Although it sounds like a simple process, it requires different personas inside an organization to build ML models that are accurate and serve the business goals.

Kubeflow has become one of the quintessential tools for organizations that want to build ML models and use Kubernetes as the underlying infrastructure. The Kubeflow project is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable. Kubeflow doesn’t recreate services but provides a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Kubeflow deployment includes the following:

Jupyter Notebooks: Create and manage interactive Jupyter Notebooks for your ML workflows.
TensorFlow Model Training: Kubeflow’s job operator can handle distributed TensorFlow training jobs.
Model Serving: Kubeflow supports a TensorFlow Serving container to export trained TensorFlow models to Kubernetes.
Pipelines: Kubeflow Pipelines is a comprehensive solution for deploying and managing end-to-end ML workflows. Use Kubeflow Pipelines for rapid and reliable experimentation.

In this blog series, we will talk about how data scientists can build, train, and deploy their ML models using Kubeflow 1.3 on Amazon EKS with Portworx and how DevOps engineers or infrastructure operators can provide the underlying enterprise-grade storage services that facilitate use cases like:

Leverage enterprise-grade Kubernetes storage for your Jupyter Notebooks providing features like file and block storage, high availability, and encryption.
Protect your ML models by taking snapshots of your Kubernetes persistent volumes.
Transfer your trained ML models to another Kubernetes cluster for deployment and inference.
Protect your entire Kubeflow deployment by using PX-Backup for Kubernetes backup and restore.

Let’s start by creating an AWS EKS cluster using m5.large instances for storage and p3.2xlarge instances for running the data preparation and training phases for our ML model. We will also attach additional IAM policies needed to run Portworx on the EKS cluster once deployed. We will use the following cluster config and use eksctl to deploy our EKS cluster:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: demo-px-bshah-mixed
  region: us-east-1
  version: "1.20"
managedNodeGroups:
  - name: storage-nodes
    instanceType: m5.xlarge
    minSize: 3
    maxSize: 3
    volumeSize: 100
    #ami: auto
    amiFamily: AmazonLinux2
    labels: {role: worker, "portworx.io/node-type": "storage"}
    tags:
      nodegroup-role: worker
    iam:
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/ElasticLoadBalancingFullAccess
        - <arn-of-your-portworx-aws-iam-policy>
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        ebs: true
        fsx: true
        efs: true
        albIngress: true
        cloudWatch: true
  - name: storageless-nodes
    instanceType: p3.2xlarge
    minSize: 3
    maxSize: 6
    desiredCapacity: 3
    volumeSize: 50
    amiFamily: AmazonLinux2
    labels: {role: worker}
    tags:
      nodegroup-role: worker-storageless
    iam:
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/ElasticLoadBalancingFullAccess
        - <arn-of-your-portworx-aws-iam-policy>
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        ebs: true
        fsx: true
        efs: true
        albIngress: true
        cloudWatch: true
availabilityZones: [ 'us-east-1a', 'us-east-1b', 'us-east-1d' ]

eksctl create cluster -f demo-px-bshah-gpu.yaml

Once we have the EKS cluster up and running, we will go ahead and deploy Portworx on our EKS cluster. We won’t cover the Portworx deployment in this blog; you can follow our documentation to install Portworx. Portworx will auto-provision EBS volumes, attach them to your EKS worker nodes, and aggregate those into a single Portworx storage pool that can be used by our Jupyter Notebooks.
Next, let’s create object storage credentials on the Portworx cluster. We will use the id for these credentials in our storage class definition, so any persistent volume dynamically provisioned using the storageclass will use the object store bucket to store cloudsnaps:

/opt/pwx/bin/pxctl credentials create --provider s3 --s3-access-key <<access-key-id>> --s3-secret-key <<secret-access-key>> --s3-region us-east-1 --s3-endpoint s3.amazonaws.com kb-s3-px-creds

/opt/pwx/bin/pxctl credentials validate kb-s3-px-creds

Next, let’s create a block and file Portworx storage class using the yaml files below and change the default storage class from “aws gp2” to “Portworx px-db” storage class. We will also create a new schedule policy, which takes a local snapshot and a cloudsnap every 15 minutes, and use it to protect our ML pipelines.

#cat weeklyandinterval.yaml 
apiVersion: stork.libopenstorage.org/v1alpha1
kind: SchedulePolicy
metadata:
  name: weekly-and-interval
policy:
  interval:
    intervalMinutes: 15
  weekly:
    day: "Friday"
    time: "10:00PM"

kubectl apply -f weeklyandinterval.yaml 
schedulepolicy.stork.libopenstorage.org/weekly-and-interval configured

#cat daily.yaml
 apiVersion: stork.libopenstorage.org/v1alpha1
 kind: SchedulePolicy
 metadata:
   name: daily
 policy:
  interval: 
    intervalMinutes: 30    
  daily:
     time: "10:14PM"
     retain: 3

kubectl apply -f daily.yaml 
schedulepolicy.stork.libopenstorage.org/daily configured

kubectl get schedulepolicy
NAME                       AGE
daily                      4m55s
default-daily-policy       48m
default-interval-policy    48m
default-migration-policy   48m
default-monthly-policy     48m
default-weekly-policy      48m
weekly-and-interval        4m31s

######################
cat standard-sc.yaml 
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: standard-sc
provisioner: kubernetes.io/portworx-volume
parameters:
  repl: "3"
  io_profile: "auto"
  snapshotschedule.stork.libopenstorage.org/default-schedule: |
    schedulePolicyName: daily
    annotations:
      portworx/snapshot-type: local
  snapshotschedule.stork.libopenstorage.org/weekly-schedule: |
    schedulePolicyName: weekly-and-interval
    annotations:
      portworx/snapshot-type: cloud
      portworx/cloud-cred-id: <<object-store-credentials-id>>

kubectl apply -f standard-sc.yaml 
storageclass.storage.k8s.io/standard-sc created

######################

cat shared-sc.yaml 
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: shared-sc
provisioner: kubernetes.io/portworx-volume
parameters:
  repl: "2"
  allow_all_ips: "true"
  sharedv4: "true"
  snapshotschedule.stork.libopenstorage.org/default-schedule: |
    schedulePolicyName: daily
    annotations:
      portworx/snapshot-type: local
  snapshotschedule.stork.libopenstorage.org/weekly-schedule: |
    schedulePolicyName: weekly-and-interval
    annotations:
      portworx/snapshot-type: cloud
      portworx/cloud-cred-id: <<object-store-credentials-id>>

kubectl apply -f shared-sc.yaml 
storageclass.storage.k8s.io/shared-sc created

######################
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
storageclass.storage.k8s.io/gp2 patched

kubectl patch storageclass standard-sc -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
storageclass.storage.k8s.io/standard-sc patched

kubectl get sc                                                                                                                  
NAME                             PROVISIONER                     RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
shared-sc                     kubernetes.io/portworx-volume   Delete       Immediate           false                  47s
standard-sc (default)         kubernetes.io/portworx-volume   Delete       Immediate           false                  53s

Next, let’s go ahead and install Kubeflow 1.3 on our EKS cluster running Portworx. The installation process for Kubeflow has been greatly simplified for the latest release. Just use the following steps to get Kubeflow up and running. For this blog and demo, I am just using port-forward to access the Kubeflow UI. But, you can configure https access to the istio-ingressgateway to securely access the Kubeflow UI using the loadbalancer IP address.

git clone [https://github.com/kubeflow/manifests.git](https://github.com/kubeflow/manifests.git)

while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

kubeflow-1

Once Kubeflow is up and running, you can look at the persistent volumes dynamically provisioned from Portworx in the kubeflow namespace. Kubeflow pipelines produce and store both metadata and artifacts from each run. The metadata from the pipeline runs are stored in the MySQL database and the artifacts are stored in a MinIO server. You will also see a katib-mysql persistent volume, which will be used to store the artifacts from the Katib, hyperparameter tuning module of Kubeflow. All of these volumes, might not be important for data scientists, but still need to be replicated and protected using Portworx to ensure continuous availability for Kubeflow.

NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
katib-mysql      Bound    pvc-9d4b6c6e-404f-4ee3-81b0-4e001a0091d0   10Gi       RWO            standard-sc    12m
minio-pvc        Bound    pvc-d34e3cfa-d537-49ae-8453-5d86c94ea392   20Gi       RWO            standard-sc    12m
mysql-pv-claim   Bound    pvc-af91f6e3-ac4d-42d6-9b0c-5228d4a17f56   20Gi       RWO            standard-sc    12m

In addition to the Kubeflow Kubernetes cluster, we will also deploy another EKS cluster for PX-Backup. We will use the PX-Backup instance to enable end-to-end Kubernetes backup and restore for our ML pipelines running on Kubeflow. To deploy PX-Backup on an EKS cluster, follow the instructions on our documentation site.
Once you have PX-Backup up and running, we will go ahead and create a cloud account using our AWS Access Key ID and Secret Access Key, add an Object Storage bucket and add our Kubeflow cluster to the PX-Backup UI.

That’s all the setup you need to do as part of this blog. In the next blog, we will talk about how you can use the Kubeflow UI to create your Jupyter Notebooks for data preparation, training, and inference. We will also discuss how to use Portworx snapshots and PX-Backup to restore your trained models to the same cluster and transfer your trained models to a new EKS cluster respectively.