Portworx & Red Hat Hands-on Labs Register Now
Contents
- Why Portworx for LLMs on Kubernetes
- Step-by-step guide to deploying DeepSeek with Portworx
In this article, we’ll explain why LLMs like DeepSeek R1 are often deployed and scaled on Kubernetes, and why you should deploy LLMs with Portworx for Kubernetes storage. We’ll also include a step-by-step tutorial to deploying DeepSeek R1 on your Kubernetes cluster using vLLM as an inference engine and Portworx for storage management.
Why deploy LLMs on Kubernetes
According to the 2024 Voice of Kubernetes Experts report, 54% of organizations running Kubernetes in production are using them to support AI/ML workloads. Inference is faster when model caching is used, and Portworx, as the leading container data management and Kubernetes storage solution, is a natural choice to accelerate these workloads.
Why run LLMs using Portworx Volumes
Using Portworx volumes for model caching offers several advantages, particularly when deploying applications like vLLM on Kubernetes. Here are the key benefits:
1. High Availability and Reliability
- Portworx ensures data redundancy and replication across nodes in the Kubernetes cluster, reducing the risk of data loss due to node or pod failures.
- Automatic failover mechanisms ensure uninterrupted access to the cache even during node failures.
2. Performance Optimization
- Low Latency: Portworx volumes provide high IOPS and low latency, which is crucial for caching LLMs to optimize inference speed.
- Locality Awareness: It intelligently manages data locality, reducing access time by serving cache data from the closest storage node.
3. Dynamic Scaling
- Portworx supports dynamic provisioning, enabling the cache volume to scale up or down based on model size or traffic demands.
- Elastic scalability ensures that your application performs well even under varying workloads.
4. Data Persistence
- Cached model data is persisted across pod restarts or rescheduling, preventing the need to reload models from scratch.
- Persistent volumes ensure faster recovery times and reduced initialization overhead after failures.
5. Multi-Model Support
- Portworx volumes can host multiple cached models simultaneously, making it suitable for multi-tenant or multi-model deployments where different models need to be accessed concurrently.
6. Kubernetes-Native Integration
- Portworx is designed to integrate seamlessly with Kubernetes, supporting features like Persistent Volumes (PVs), Persistent Volume Claims (PVCs), and Storage Classes.
- It simplifies storage management through declarative YAML configurations and Kubernetes-native tools.
7. Cost Efficiency
- With efficient caching on Portworx, frequently accessed data like model weights can remain in the cache, reducing expensive cloud storage or retrieval costs from S3 or other remote storage solutions.
- Fine-grained control over replication factors allows balancing between performance and storage costs.
8. Snapshot and Backup Capabilities
- Portworx enables taking snapshots of cached data, making it easy to restore or replicate models across environments (e.g., staging and production).
- Backup and disaster recovery capabilities ensure that even cached data can be protected and restored.
9. Support for Hybrid and Multi-Cloud Deployments
- Portworx supports hybrid and multi-cloud environments, making it ideal for deployments where models and workloads span across on-premises and cloud infrastructures.Consistent storage across environments simplifies deployment and management.
- Consistent storage across environments simplifies deployment and management.
10. Advanced Security
- Portworx provides features like encryption at rest and in transit, ensuring that sensitive data, including cached model weights, remains secure.
- Role-based access control (RBAC) and integration with Kubernetes security mechanisms enhance protection.
11. Ease of Use
- With Portworx, administrators can dynamically manage storage with Kubernetes-native tools, avoiding the need for manual storage allocation or adjustments.
- Self-healing capabilities ensure minimal manual intervention in maintaining storage health.
12. Reduced Latency in Multi-Node Clusters
- For distributed workloads, Portworx enables caching closer to compute nodes, reducing latency in model inference pipelines.
- This is especially beneficial in environments where models are frequently queried, ensuring consistent and fast performance.
By using Portworx Volumes for caching, you can ensure a robust, high-performing, and cost-efficient storage solution that complements the speed and scalability needs of deploying large language models like those served with vLLM.
Step-by-Step guide
Here’s a step-by-step guide to deploying a DeepSeek R1 Hugging Face model using vLLM on a Kubernetes cluster with Portworx Volumes for caching:
Step 1: Prerequisites
- Kubernetes Cluster: Ensure you have a running Kubernetes cluster with GPUs; these can be provisioned as managed services from Google GKE, Azure AKS, or Amazon EKS.
- kubectl: Install and configure
kubectl
to interact with your cluster. - Portworx: Install and configure Portworx as the storage solution in your Kubernetes cluster. If you’re not already using Portworx, you can get started with a free trial.
- Docker Image for vLLM: Create or use an available vLLM Docker image with the Hugging Face model and dependencies installed.
Step 2: Install Portworx in Kubernetes
- Follow the official Portworx documentation to install Portworx on your Kubernetes cluster. Ensure the cluster supports Persistent Volumes.
- Verify Portworx installation by running:
kubectl get storagecluster -A
Step 3: Create a Portworx Storage Class
Define a storage class that vLLM will use for caching. Create a YAML file (e.g., portworx-storage-class.yaml
):
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: portworx-sc provisioner: kubernetes.io/portworx-volume parameters: repl: "3" # Number of replicas
Apply the storage class:
kubectl apply -f portworx-storage-class.yaml
Step 4: Create a Persistent Volume Claim (PVC)
Create a PVC that uses the Portworx storage class for caching. Save it as portworx-pvc.yaml
:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: cache-pvc namespace: deepseek spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: portworx-sc
Apply the PVC:
kubectl apply -f portworx-pvc.yaml
Step 5: Create a Deployment for vLLM-hosted DeepSeek R1 model
Create a Kubernetes Deployment to deploy the vLLM model.
Copy the manifest below and replace <hugging-face-token>
with your Hugging Face API token.
Save this as deepseekr1-deployment.yaml
:
apiVersion: apps/v1 kind: Deployment metadata: name: deepseek-r1 namespace: deepseek spec: replicas: 1 selector: matchLabels: app: deepseek template: metadata: labels: app: deepseek spec: containers: - name: deepseek image: vllm/vllm-openai:latest imagePullPolicy: IfNotPresent env: - name: HUGGING_FACE_HUB_TOKEN value: "<hugging-face-token>" args: [ "--model", "deepseek-ai/DeepSeek-R1", "--port", "8000", "--trust-remote-code" ] ports: - containerPort: 8000 volumeMounts: - name: cache-volume mountPath: /root/.cache/huggingface volumes: - name: cache-volume persistentVolumeClaim: claimName: cache-pvc
Apply the deployment:
kubectl apply -f deepseekr1-deployment.yaml
Step 6: Expose the vLLM Service
Create a Service to expose the vLLM deployment. Save it as deepseekr1-service.yaml
:
apiVersion: v1 kind: Service metadata: name: deepseek-r1-service namespace: deepseek spec: selector: app: deepseek ports: - protocol: TCP port: 80 targetPort: 8000 type: LoadBalancer # Change to ClusterIP or NodePort if needed
Apply the service:
kubectl apply -f deepseekr1-service.yaml
Step 7: Verify the Deployment
1. Check the pods:
kubectl get pods -n deepseek
2. Verify the PVC is bound:
kubectl get pvc
3. Access the service:
- Use the external IP of the LoadBalancer service to access the vLLM endpoint.
- For example:
curl -X POST "http://<external-ip>/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'
Step 8: Monitor and Scale
1. Logs: Monitor logs to ensure the deployment works:
kubectl logs -f deployment/deepseekr1-deployment -n deepseek
2. Scaling:
Update the replicas
field in the deployment YAML to scale horizontally.
3. Performance Tuning:
- Adjust resource requests/limits for CPU and memory in the deployment spec.
- Use GPU nodes if required for faster inference.
Use Pure Storage FlashArray as a Direct Access volume for cache To leverage the performance of Portworx FlashArray Direct Access Volumes to optimize the caching of large models, we just need to change the storage class configuration in Step 3 as below:
kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: sc-portworx-fa-direct-access provisioner: pxd.portworx.com parameters: backend: "pure_block" max_iops: "1000" max_bandwidth: "1G" allowVolumeExpansion: true
This setup ensures your model uses vLLM for efficient serving and Portworx for caching, optimizing performance and storage reliability.
Get Started with Portworx
When it comes to AI/ML workflows like model inference, fast access to data is essential. Model caching with a leading Kubernetes storage and data management solution like Portworx is critical to accelerating these workflows. To learn more about Portworx, get started with a free trial, or reach out to us for a dedicated conversation on your AI/ML challenges today.
Share
Subscribe for Updates
About Us
Portworx is the leader in cloud native storage for containers.
Thanks for subscribing!
