Portworx & Red Hat Hands-on Labs Register Now

Contents

  • Why Portworx for LLMs on Kubernetes
  • Step-by-step guide to deploying DeepSeek with Portworx

In this article, we’ll explain why LLMs like DeepSeek R1 are often deployed and scaled on Kubernetes, and why you should deploy LLMs with Portworx for Kubernetes storage. We’ll also include a step-by-step tutorial to deploying DeepSeek R1 on your Kubernetes cluster using vLLM as an inference engine and Portworx for storage management.

Why deploy LLMs on Kubernetes

According to the 2024 Voice of Kubernetes Experts report, 54% of organizations running Kubernetes in production are using them to support AI/ML workloads. Inference is faster when model caching is used, and Portworx, as the leading container data management and Kubernetes storage solution, is a natural choice to accelerate these workloads.

Why run LLMs using Portworx Volumes

Using Portworx volumes for model caching offers several advantages, particularly when deploying applications like vLLM on Kubernetes. Here are the key benefits:

1. High Availability and Reliability

  • Portworx ensures data redundancy and replication across nodes in the Kubernetes cluster, reducing the risk of data loss due to node or pod failures.
  • Automatic failover mechanisms ensure uninterrupted access to the cache even during node failures.

2. Performance Optimization

  • Low Latency: Portworx volumes provide high IOPS and low latency, which is crucial for caching LLMs to optimize inference speed.
  • Locality Awareness: It intelligently manages data locality, reducing access time by serving cache data from the closest storage node.

3. Dynamic Scaling

  • Portworx supports dynamic provisioning, enabling the cache volume to scale up or down based on model size or traffic demands.
  • Elastic scalability ensures that your application performs well even under varying workloads.

4. Data Persistence

  • Cached model data is persisted across pod restarts or rescheduling, preventing the need to reload models from scratch.
  • Persistent volumes ensure faster recovery times and reduced initialization overhead after failures.

5. Multi-Model Support

  • Portworx volumes can host multiple cached models simultaneously, making it suitable for multi-tenant or multi-model deployments where different models need to be accessed concurrently.

6. Kubernetes-Native Integration

  • Portworx is designed to integrate seamlessly with Kubernetes, supporting features like Persistent Volumes (PVs), Persistent Volume Claims (PVCs), and Storage Classes.
  • It simplifies storage management through declarative YAML configurations and Kubernetes-native tools.

7. Cost Efficiency

  • With efficient caching on Portworx, frequently accessed data like model weights can remain in the cache, reducing expensive cloud storage or retrieval costs from S3 or other remote storage solutions.
  • Fine-grained control over replication factors allows balancing between performance and storage costs.

8. Snapshot and Backup Capabilities

  • Portworx enables taking snapshots of cached data, making it easy to restore or replicate models across environments (e.g., staging and production).
  • Backup and disaster recovery capabilities ensure that even cached data can be protected and restored.

9. Support for Hybrid and Multi-Cloud Deployments

  • Portworx supports hybrid and multi-cloud environments, making it ideal for deployments where models and workloads span across on-premises and cloud infrastructures.Consistent storage across environments simplifies deployment and management.
  • Consistent storage across environments simplifies deployment and management.

10. Advanced Security

  • Portworx provides features like encryption at rest and in transit, ensuring that sensitive data, including cached model weights, remains secure.
  • Role-based access control (RBAC) and integration with Kubernetes security mechanisms enhance protection.

11. Ease of Use

  • With Portworx, administrators can dynamically manage storage with Kubernetes-native tools, avoiding the need for manual storage allocation or adjustments.
  • Self-healing capabilities ensure minimal manual intervention in maintaining storage health.

12. Reduced Latency in Multi-Node Clusters

  • For distributed workloads, Portworx enables caching closer to compute nodes, reducing latency in model inference pipelines.
  • This is especially beneficial in environments where models are frequently queried, ensuring consistent and fast performance.

By using Portworx Volumes for caching, you can ensure a robust, high-performing, and cost-efficient storage solution that complements the speed and scalability needs of deploying large language models like those served with vLLM.

Step-by-Step guide

Here’s a step-by-step guide to deploying a DeepSeek R1 Hugging Face model using vLLM on a Kubernetes cluster with Portworx Volumes for caching:

Step 1: Prerequisites

  • Kubernetes Cluster: Ensure you have a running Kubernetes cluster with GPUs; these can be provisioned as managed services from Google GKE, Azure AKS, or Amazon EKS.
  • kubectl: Install and configure kubectl to interact with your cluster.
  • Portworx: Install and configure Portworx as the storage solution in your Kubernetes cluster. If you’re not already using Portworx, you can get started with a free trial.
  • Docker Image for vLLM: Create or use an available vLLM Docker image with the Hugging Face model and dependencies installed.

Step 2: Install Portworx in Kubernetes

  1. Follow the official Portworx documentation to install Portworx on your Kubernetes cluster. Ensure the cluster supports Persistent Volumes.
  2. Verify Portworx installation by running:
kubectl get storagecluster -A

Step 3: Create a Portworx Storage Class
Define a storage class that vLLM will use for caching. Create a YAML file (e.g., portworx-storage-class.yaml):

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: portworx-sc
provisioner: kubernetes.io/portworx-volume
parameters:
repl: "3" # Number of replicas

Apply the storage class:

kubectl apply -f portworx-storage-class.yaml

Step 4: Create a Persistent Volume Claim (PVC)
Create a PVC that uses the Portworx storage class for caching. Save it as portworx-pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cache-pvc
  namespace: deepseek
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: portworx-sc

Apply the PVC:

kubectl apply -f portworx-pvc.yaml

Step 5: Create a Deployment for vLLM-hosted DeepSeek R1 model
Create a Kubernetes Deployment to deploy the vLLM model.

Copy the manifest below and replace <hugging-face-token> with your Hugging Face API token.

Save this as deepseekr1-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-r1
  namespace: deepseek
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek
  template:
    metadata:
      labels:
        app: deepseek
    spec:
      containers:
      - name: deepseek
        image: vllm/vllm-openai:latest
        imagePullPolicy: IfNotPresent
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          value: "<hugging-face-token>"
        args: [
          "--model", "deepseek-ai/DeepSeek-R1",
          "--port", "8000",
          "--trust-remote-code"
        ]
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: cache-volume
          mountPath: /root/.cache/huggingface
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: cache-pvc

Apply the deployment:

kubectl apply -f deepseekr1-deployment.yaml

Step 6: Expose the vLLM Service
Create a Service to expose the vLLM deployment. Save it as deepseekr1-service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: deepseek-r1-service
  namespace: deepseek
spec:
  selector:
    app: deepseek
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer  # Change to ClusterIP or NodePort if needed

Apply the service:

kubectl apply -f deepseekr1-service.yaml

Step 7: Verify the Deployment

1. Check the pods:

kubectl get pods -n deepseek

2. Verify the PVC is bound:

kubectl get pvc

3. Access the service:

  • Use the external IP of the LoadBalancer service to access the vLLM endpoint.
  • For example:
curl -X POST "http://<external-ip>/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Step 8: Monitor and Scale
1. Logs: Monitor logs to ensure the deployment works:

kubectl logs -f deployment/deepseekr1-deployment -n deepseek

2. Scaling:
Update the replicas field in the deployment YAML to scale horizontally.
3. Performance Tuning:

  • Adjust resource requests/limits for CPU and memory in the deployment spec.
  • Use GPU nodes if required for faster inference.

Use Pure Storage FlashArray as a Direct Access volume for cache To leverage the performance of Portworx FlashArray Direct Access Volumes to optimize the caching of large models, we just need to change the storage class configuration in Step 3 as below:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: sc-portworx-fa-direct-access
provisioner: pxd.portworx.com
parameters:
  backend: "pure_block"
  max_iops: "1000"
  max_bandwidth: "1G"
allowVolumeExpansion: true

This setup ensures your model uses vLLM for efficient serving and Portworx for caching, optimizing performance and storage reliability.

Get Started with Portworx

When it comes to AI/ML workflows like model inference, fast access to data is essential. Model caching with a leading Kubernetes storage and data management solution like Portworx is critical to accelerating these workflows. To learn more about Portworx, get started with a free trial, or reach out to us for a dedicated conversation on your AI/ML challenges today.

Share
Subscribe for Updates

About Us
Portworx is the leader in cloud native storage for containers.

Girish Sadhani

Member of Technical Staff, Portworx by Pure Storage
link
Guide to AWS Regions
December 3, 2024 Education
The Complete Guide to AWS Regions & Availability Zones
Portworx Team
Portworx Team
link
August 22, 2024 Education
Enterprises to Double Cloud Native Usage by 2029
Janet Wi
Janet Wi
link
kubernetes persistent storage
July 20, 2024 Education
Kubernetes Persistent Storage: Definition and Examples
Portworx Team
Portworx Team