Deploy HA JupyterHub on Amazon Kubernetes Service

How To

The Jupyter Notebook is an open-source web application that allows data scientists to create and share documents that contain live code, equations, visualizations, comments, and narrative text. It’s a powerful integrated development environment for data exploration, data processing, data analysis, machine learning, and analytics.

The JupyterHub is a multi-user platform that brings the power of Jupyter Notebook to enterprises. It’s a scalable, portable, customizable, and flexible environment to run Jupyter Notebooks in a multi-tenant model. JupyterHub can be deployed on bare metal servers, virtual machines, public cloud infrastructure, containers, and container orchestration engines. It can be deployed and managed in Kubernetes through Helm charts.

JupyterHub is a stateful workload that depends on a reliable persistence layer. When it is deployed in Kubernetes, JupyterHub needs a cloud native, scale-out data management layer.

Portworx is a cloud native storage platform to run persistent workloads deployed on a variety of orchestration engines, including Kubernetes. With Portworx, customers can manage the database of their choice on any infrastructure using any container scheduler. It provides a single data management layer for all stateful services, no matter where they run.

JupyterHub Architecture

The JupyterHub platform has three essential components—hub, proxy, and single-user Notebook server. The hub is the heart of the platform that orchestrates the lifecycle of a Notebook. The proxy acts as the front-end to route requests to the hub, which is exposed to the outside world through an HTTP load balancer or in Kubernetes, an ingress controller. When a user logs into the platform, the hub provisions a single-user Notebook instance for them. Each user gets a dedicated instance of the Notebook that is completely isolated from the other users. In Kubernetes, the instance is mapped to a pod.

jupterHub

After a specific period of inactivity, the hub automatically culls the pod associated with the inactive user. When the same user logs in again, the hub schedules a pod that contains the state persisted during the previous session.

Behind the scenes, JupyterHub creates a persistent volume claim (PVC) and a persistent volume for each user. Even though the pod gets deleted as part of the culling process, the PV is retained, which gets attached to the new pod when an existing user logs in.

The hub maintains the state of users, groups, permissions, and other settings in an SQLite database, which is stored on the disk. There is a PVC associated with the storage volume used for persisting the database file.

Apart from the dedicated storage required by the common database and each user, JupyterHub also supports shared storage volumes that are available to all the users. This shared storage is used to populate common datasets, files, and other objects that will be available to all users of the system.

Like any stateful workload, the availability of JupyterHub is dependent on the reliability and availability of the storage engine backing the application. The availability of three volumes—database storage, per-user storage, and shared storage—are critical to the uptime of JupyterHub.

When Portworx is used as the storage orchestration engine for JupyterHub, it increases the overall reliability, availability, and mobility of the platform. Some of the key features of Portworx—such as customizable storage profiles, automatic replication, dynamic volume expansion, automated snapshots, and migration to other clusters—make it the ideal choice to run JupyterHub on Kubernetes.

This tutorial is a walk-through of the steps involved in deploying and managing a highly-available JupyterHub environment on Kubernetes. We will configure Portworx as the storage engine for all the stateful components of JupyterHub.

In summary, to run HA JupyterHub on Amazon Kubernetes Service, you need to:

Setup and configure a Kubernetes cluster in Amazon EKS
Install cloud native storage solution like Portworxon Kubernetes
Create storage classes for the database, users, and shared storage layers of JupyterHub
Deploy JupyterHub on Kubernetes through a customized Helm chart
Test failover by killing or cordoning a node in the cluster
Expand the storage volume without downtime

How to set up an EKS cluster

Portworx is fully supported on Amazon EKS. Please follow the instructions to configure an Amazon EKS cluster.

You should have a three-node Kubernetes cluster deployed based on the default EKS configuration.

$ kubectl get nodes
NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-3-23.us-west-2.compute.internal     Ready       60m   v1.13.7-eks-c57ff8
ip-192-168-73-47.us-west-2.compute.internal    Ready       60m   v1.13.7-eks-c57ff8
ip-192-168-80-126.us-west-2.compute.internal   Ready       60m   v1.13.7-eks-c57ff8

jupterHub-1

Once the cluster is up and running, install Helm. For detailed instructions on installing and configuring Helm, refer to the documentation.

If you get errors while installing Tiller on GKE, run the below commands:

$ kubectl create serviceaccount --namespace kube-system tiller
$ kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
$ helm init --service-account tiller --upgrade
$ kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'      
kubectl patch deployment tiller-deploy --namespace=kube-system --type=json --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["/tiller", "--listen=localhost:44134"]}]'

Ensure that Helm is up and running before proceeding further.

$ helm version
Client: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}

Installing Portworx on EKS

Installing Portworx on Amazon EKS is not very different from installing it on a Kubernetes cluster setup through Kops. Portworx EKS documentation has the steps involved in running the Portworx cluster in a Kubernetes environment deployed in AWS.

Portworx cluster needs to be up and running on EKS before proceeding to the next step. The kube-system namespace should have the Portworx pods in running state.

$ kubectl get pods -n=kube-system -l name=portworx
NAME             READY     STATUS    RESTARTS   AGE
portworx-blqjh   1/1       Running   0          8d
portworx-c8bf2   1/1       Running   0          8d
portworx-z2j6z   1/1       Running   0          8d

jupterHub-2

Creating Storage Classes for JupyterHub

Through storage class objects, an admin can define different classes of Portworx volumes that are offered in a cluster. These classes will be used during the dynamic provisioning of volumes. The storage class defines the replication factor, I/O profile (e.g., for a database or a CMS), and priority (e.g., SSD or HDD). These parameters impact the availability and throughput of workloads and can be specified for each volume. This is important; as an example, a production database will have different requirements than a development Jenkins cluster.

JupyterHub needs two storage classes with distinct capabilities. The first storage class is meant for the database and user home directories. This needs to be replicated across multiple nodes to ensure high availability. The second type of storage is a shared volume that is available in read/write mode to all the users.

Let’s create the storage class with a replication factor of 3, which ensures data redundancy for the database and user home directories.

$ cat > px-jhub-sc.yaml << EOF
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
    name: px-jhub-sc
provisioner: kubernetes.io/portworx-volume
parameters:
   repl: "3"
EOF

$ kubectl create -f px-jhub-sc.yaml
storageclass.storage.k8s.io "px-jhub-sc" created

Next, we will create the storage class for the shared volume. Note that it uses the CMS profile to optimize access and throughput. The replication factor is set to 1 since it is not replicated across nodes.

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: px-jhub-shared-sc
provisioner: kubernetes.io/portworx-volume
parameters:
  repl: "1"
  shared: "true"
  io_profile: "cms"

$ kubectl create -f px-jhub-shared-sc.yaml
storageclass.storage.k8s.io "px-jhub-shared-sc" created

We also need to create a PVC and PV based on the shared storage class. The PVC is passed onto the JupyterHub configuration to mount the shared volume.

$ kubectl create -f px-jhub-shared-pvc.yaml
persistentvolumeclaim "px-jhub-shared-vol" created

The PVC and PV are created and ready to use.

$ kubectl get pvc
NAME                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
px-jhub-shared-vol   Bound    pvc-7825d239-d2ec-11e9-be9b-0ac6c62f52b2   1Gi        RWX            px-jhub-shared-sc   15s

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                        STORAGECLASS        REASON   AGE
pvc-7825d239-d2ec-11e9-be9b-0ac6c62f52b2   1Gi        RWX            Delete           Bound    default/px-jhub-shared-vol   px-jhub-shared-sc            53s

Installing JupyterHub

JupyterHub is available as a Helm chart, which needs to be customized to work with Portworx. This is done by passing additional configuration to the chart through the config.yaml file.

First, we need to generate a random hex string representing 32 bytes to use as a security token.

$ openssl rand -hex 32
45a74f657dc4fcdc0b1a1cf2edbb36b6d5d39a72e7327186ca87db811ac764e6

Create a file called config.yaml and add the generated token.

proxy:
  secretToken: "45a74f657dc4fcdc0b1a1cf2edbb36b6d5d39a72e7327186ca87db811ac764e6"

Next, we need to customize the user environment by passing the appropriate storage configuration. Add the below text to config.yaml.

singleuser:
  storage:
    dynamic:
      storageClass: px-jhub-sc
    extraVolumes:
      - name: jhub-shared
        persistentVolumeClaim:
          claimName: px-jhub-shared-vol
    extraVolumeMounts:
      - name: jhub-shared
        mountPath: /home/shared

Notice that the home directory is taking advantage of dynamic provisioning while the shared volume is based on the PVC created in the previous step. Each time a new user logs in, a PVC and PV are dynamically created for her based on the specified storage class. The shared PVC, px-jhub-shared-vol, is attached to each pod, accessible as the /home/shared directory.

The SQLite database is persisted on a volume dynamically provisioned by Portworx. We will pass this configuration in the same config.yaml file.

hub:
  db:
    type: sqlite-pvc
    pvc:
      storageClassName: px-jhub-sc

Finally, we will add user admin to the administrator group. An admin in JupyterHub has access to other users’ Notebooks.

auth:
  admin:
    users:
      - admin

Below is the complete config.yaml file with all the settings:

proxy:
  secretToken: "45a74f657dc4fcdc0b1a1cf2edbb36b6d5d39a72e7327186ca87db811ac764e6"
singleuser:
  storage:
    dynamic:
      storageClass: px-jhub-sc
    extraVolumes:
      - name: jhub-shared
        persistentVolumeClaim:
          claimName: px-jhub-shared-vol
    extraVolumeMounts:
      - name: jhub-shared
        mountPath: /home/shared  
hub:
  db:
    type: sqlite-pvc
    pvc:
      storageClassName: px-jhub-sc
auth:
  admin:
    users:
      - admin

We are now ready to deploy the JupyterHub Helm chart. Add the chart to the repo to refresh it.

$ helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Skip local chart repository
...Successfully got an update from the "stable" chart repository
...Successfully got an update from the "jupyterhub" chart repository
Update Complete. ⎈ Happy Helming!⎈

Let’s deploy the chart, passing the config.yaml with our custom configuration.

$ helm upgrade --install jhub jupyterhub/jupyterhub \
  --version=0.8.2 \
  --values config.yaml

This results in the creation of two pods—hub and proxy—along with a service that exposes the proxy pod through a load balancer.

jupterHub-3

jupterHub-4

Let’s explore the current state of storage classes, PVC and PV.

jupterHub-5

As soon as a user signs in, a new pod and a PVC are created for the pod. Let’s login to the hub with username admin and password admin. Access JupyterHub through the load balancer’s IP address.

Immediately after the user logs in, JupyterHub spawns a new pod.

jupterHub-6

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
hub-7874f475b8-5gpx4     1/1     Running   0          10m
jupyter-admin            1/1     Running   0          18s
proxy-78996bfc89-kwhtm   1/1     Running   0          10m

The pod also gets a new PVC that follows the naming convention of claim-<username>.

$ kubectl get pvc | awk {'print $1" " $2" "$3" "$6'} | column -t
NAME                STATUS  VOLUME                                    MODES
claim-admin         Bound   pvc-7ba77bf3-d2ee-11e9-be9b-0ac6c62f52b2  px-jhub-sc
hub-db-dir          Bound   pvc-10201070-d2ed-11e9-be9b-0ac6c62f52b2  px-jhub-sc
px-jhub-shared-vol  Bound   pvc-7825d239-d2ec-11e9-be9b-0ac6c62f52b2  px-jhub-shared-sc

Note: This installation of JupyterHub is not integrated with an authentication system. Any arbitrary username and password can be used to log in. For securing production deployments, refer to JupyterHub guide on integrating authentication and authorization.

We will now log in as a different user, user1, to create a new profile.

jupterHub-7