A Step-by-Step Guide to Deploying DeepSeek R1 with Portworx

Platform
- Explore the PlatformThe Portworx PlatformThe cloud-native storage software platform for Kubernetes that saves costs while delivering storage at scale.
Pricing
Solutions
- Learn MoreSolutionsAutomate, protect, and unify data for Kubernetes applications
Resources
- View all ResourcesResourcesLearn more about Portworx products, solutions and services through these analyst and technical reports, white papers and case studies.
Support
- View all DocsTechnical DocumentationCheck out the docs portal for step-by-step guides and tips on how to use Portworx.
- Read the RadarDownload the 2023 GigaOm Radar for Cloud-native storage to see why Portworx is the leader for the 4th consecutive year.
Company
- Learn MoreAbout UsLearn about Portworx, the leading Kubernetes Data Services Platform enterprises trust to run mission-critical applications in containers in production.

Schedule a Demo

Education

Why Portworx for LLMs on Kubernetes
Step-by-step guide to deploying DeepSeek with Portworx

In this article, we’ll explain why LLMs like DeepSeek R1 are often deployed and scaled on Kubernetes, and why you should deploy LLMs with Portworx for Kubernetes storage. We’ll also include a step-by-step tutorial to deploying DeepSeek R1 on your Kubernetes cluster using vLLM as an inference engine and Portworx for storage management.

Why deploy LLMs on Kubernetes

According to the 2024 Voice of Kubernetes Experts report, 54% of organizations running Kubernetes in production are using them to support AI/ML workloads. Inference is faster when model caching is used, and Portworx, as the leading container data management and Kubernetes storage solution, is a natural choice to accelerate these workloads.

Why run LLMs using Portworx Volumes

Using Portworx volumes for model caching offers several advantages, particularly when deploying applications like vLLM on Kubernetes. Here are the key benefits:

1. High Availability and Reliability

Portworx ensures data redundancy and replication across nodes in the Kubernetes cluster, reducing the risk of data loss due to node or pod failures.
Automatic failover mechanisms ensure uninterrupted access to the cache even during node failures.

2. Performance Optimization

Low Latency: Portworx volumes provide high IOPS and low latency, which is crucial for caching LLMs to optimize inference speed.
Locality Awareness: It intelligently manages data locality, reducing access time by serving cache data from the closest storage node.

3. Dynamic Scaling

Portworx supports dynamic provisioning, enabling the cache volume to scale up or down based on model size or traffic demands.
Elastic scalability ensures that your application performs well even under varying workloads.

4. Data Persistence

Cached model data is persisted across pod restarts or rescheduling, preventing the need to reload models from scratch.
Persistent volumes ensure faster recovery times and reduced initialization overhead after failures.

5. Multi-Model Support

Portworx volumes can host multiple cached models simultaneously, making it suitable for multi-tenant or multi-model deployments where different models need to be accessed concurrently.

6. Kubernetes-Native Integration

Portworx is designed to integrate seamlessly with Kubernetes, supporting features like Persistent Volumes (PVs), Persistent Volume Claims (PVCs), and Storage Classes.
It simplifies storage management through declarative YAML configurations and Kubernetes-native tools.

7. Cost Efficiency

With efficient caching on Portworx, frequently accessed data like model weights can remain in the cache, reducing expensive cloud storage or retrieval costs from S3 or other remote storage solutions.
Fine-grained control over replication factors allows balancing between performance and storage costs.

8. Snapshot and Backup Capabilities

Portworx enables taking snapshots of cached data, making it easy to restore or replicate models across environments (e.g., staging and production).
Backup and disaster recovery capabilities ensure that even cached data can be protected and restored.

9. Support for Hybrid and Multi-Cloud Deployments

Portworx supports hybrid and multi-cloud environments, making it ideal for deployments where models and workloads span across on-premises and cloud infrastructures.Consistent storage across environments simplifies deployment and management.
Consistent storage across environments simplifies deployment and management.

10. Advanced Security

Portworx provides features like encryption at rest and in transit, ensuring that sensitive data, including cached model weights, remains secure.
Role-based access control (RBAC) and integration with Kubernetes security mechanisms enhance protection.

11. Ease of Use

With Portworx, administrators can dynamically manage storage with Kubernetes-native tools, avoiding the need for manual storage allocation or adjustments.
Self-healing capabilities ensure minimal manual intervention in maintaining storage health.

12. Reduced Latency in Multi-Node Clusters

For distributed workloads, Portworx enables caching closer to compute nodes, reducing latency in model inference pipelines.
This is especially beneficial in environments where models are frequently queried, ensuring consistent and fast performance.

By using Portworx Volumes for caching, you can ensure a robust, high-performing, and cost-efficient storage solution that complements the speed and scalability needs of deploying large language models like those served with vLLM.

Step-by-Step guide

Here’s a step-by-step guide to deploying a DeepSeek R1 Hugging Face model using vLLM on a Kubernetes cluster with Portworx Volumes for caching:

Step 1: Prerequisites

Kubernetes Cluster: Ensure you have a running Kubernetes cluster with GPUs; these can be provisioned as managed services from Google GKE, Azure AKS, or Amazon EKS.
kubectl: Install and configure kubectl to interact with your cluster.
Portworx: Install and configure Portworx as the storage solution in your Kubernetes cluster. If you’re not already using Portworx, you can get started with a free trial.
Docker Image for vLLM: Create or use an available vLLM Docker image with the Hugging Face model and dependencies installed.

Step 2: Install Portworx in Kubernetes

Follow the official Portworx documentation to install Portworx on your Kubernetes cluster. Ensure the cluster supports Persistent Volumes.
Verify Portworx installation by running:

kubectl get storagecluster -A

kubectl get storagecluster -A

Step 3: Create a Portworx Storage Class
Define a storage class that vLLM will use for caching. Create a YAML file (e.g., portworx-storage-class.yaml):

apiVersion: storage.k8s.io/v1

kind: StorageClass

metadata:

provisioner: kubernetes.io/portworx-volume

parameters:

repl: "3" # Number of replicas

apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: portworx-sc provisioner: kubernetes.io/portworx-volume parameters: repl: "3" # Number of replicas

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: portworx-sc
provisioner: kubernetes.io/portworx-volume
parameters:
repl: "3" # Number of replicas

Apply the storage class:

kubectl apply -f portworx-storage-class.yaml

kubectl apply -f portworx-storage-class.yaml

Step 4: Create a Persistent Volume Claim (PVC)
Create a PVC that uses the Portworx storage class for caching. Save it as portworx-pvc.yaml:

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

namespace: deepseek

spec:

accessModes:

- ReadWriteOnce

resources:

requests:

storage: 50Gi

storageClassName: portworx-sc

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: cache-pvc namespace: deepseek spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: portworx-sc

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cache-pvc
  namespace: deepseek
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: portworx-sc

Apply the PVC:

kubectl apply -f portworx-pvc.yaml

kubectl apply -f portworx-pvc.yaml

Step 5: Create a Deployment for vLLM-hosted DeepSeek R1 model
Create a Kubernetes Deployment to deploy the vLLM model.

Copy the manifest below and replace <hugging-face-token> with your Hugging Face API token.

Save this as deepseekr1-deployment.yaml:

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: deepseek

spec:

replicas: 1

selector:

matchLabels:

app: deepseek

template:

metadata:

labels:

app: deepseek

spec:

containers:

- name: deepseek

image: vllm/vllm-openai:latest

imagePullPolicy: IfNotPresent

env:

- name: HUGGING_FACE_HUB_TOKEN

value: "<hugging-face-token>"

args: [

"--model", "deepseek-ai/DeepSeek-R1",

"--port", "8000",

"--trust-remote-code"

]

ports:

- containerPort: 8000

volumeMounts:

- name: cache-volume

mountPath: /root/.cache/huggingface

volumes:

- name: cache-volume

persistentVolumeClaim:

claimName: cache-pvc

apiVersion: apps/v1 kind: Deployment metadata: name: deepseek-r1 namespace: deepseek spec: replicas: 1 selector: matchLabels: app: deepseek template: metadata: labels: app: deepseek spec: containers: - name: deepseek image: vllm/vllm-openai:latest imagePullPolicy: IfNotPresent env: - name: HUGGING_FACE_HUB_TOKEN value: "<hugging-face-token>" args: [ "--model", "deepseek-ai/DeepSeek-R1", "--port", "8000", "--trust-remote-code" ] ports: - containerPort: 8000 volumeMounts: - name: cache-volume mountPath: /root/.cache/huggingface volumes: - name: cache-volume persistentVolumeClaim: claimName: cache-pvc

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-r1
  namespace: deepseek
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek
  template:
    metadata:
      labels:
        app: deepseek
    spec:
      containers:
      - name: deepseek
        image: vllm/vllm-openai:latest
        imagePullPolicy: IfNotPresent
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          value: "<hugging-face-token>"
        args: [
          "--model", "deepseek-ai/DeepSeek-R1",
          "--port", "8000",
          "--trust-remote-code"
        ]
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: cache-volume
          mountPath: /root/.cache/huggingface
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: cache-pvc

Apply the deployment:

kubectl apply -f deepseekr1-deployment.yaml

kubectl apply -f deepseekr1-deployment.yaml

Step 6: Expose the vLLM Service
Create a Service to expose the vLLM deployment. Save it as deepseekr1-service.yaml:

apiVersion: v1

kind: Service

metadata:

namespace: deepseek

spec:

selector:

app: deepseek

ports:

- protocol: TCP

port: 80

targetPort: 8000

type: LoadBalancer # Change to ClusterIP or NodePort if needed

apiVersion: v1 kind: Service metadata: name: deepseek-r1-service namespace: deepseek spec: selector: app: deepseek ports: - protocol: TCP port: 80 targetPort: 8000 type: LoadBalancer # Change to ClusterIP or NodePort if needed

apiVersion: v1
kind: Service
metadata:
  name: deepseek-r1-service
  namespace: deepseek
spec:
  selector:
    app: deepseek
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer  # Change to ClusterIP or NodePort if needed

Apply the service:

kubectl apply -f deepseekr1-service.yaml

kubectl apply -f deepseekr1-service.yaml

Step 7: Verify the Deployment

1. Check the pods:

kubectl get pods -n deepseek

kubectl get pods -n deepseek

2. Verify the PVC is bound:

kubectl get pvc

kubectl get pvc

3. Access the service:

Use the external IP of the LoadBalancer service to access the vLLM endpoint.
For example:

curl -X POST "http://<external-ip>/v1/chat/completions" \

-H "Content-Type: application/json" \

--data '{

"model": "deepseek-ai/DeepSeek-R1",

"messages": [

{

"role": "user",

"content": "What is the capital of France?"

}

]

curl -X POST "http://<external-ip>/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'

curl -X POST "http://<external-ip>/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Step 8: Monitor and Scale
1. Logs: Monitor logs to ensure the deployment works:

kubectl logs -f deployment/deepseekr1-deployment -n deepseek

kubectl logs -f deployment/deepseekr1-deployment -n deepseek

2. Scaling:
Update the replicas field in the deployment YAML to scale horizontally.
3. Performance Tuning:

Adjust resource requests/limits for CPU and memory in the deployment spec.
Use GPU nodes if required for faster inference.

Use Pure Storage FlashArray as a Direct Access volume for cache To leverage the performance of Portworx FlashArray Direct Access Volumes to optimize the caching of large models, we just need to change the storage class configuration in Step 3 as below:

kind: StorageClass

apiVersion: storage.k8s.io/v1

metadata:

provisioner: pxd.portworx.com

parameters:

backend: "pure_block"

max_iops: "1000"

max_bandwidth: "1G"

allowVolumeExpansion: true

kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: sc-portworx-fa-direct-access provisioner: pxd.portworx.com parameters: backend: "pure_block" max_iops: "1000" max_bandwidth: "1G" allowVolumeExpansion: true

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: sc-portworx-fa-direct-access
provisioner: pxd.portworx.com
parameters:
  backend: "pure_block"
  max_iops: "1000"
  max_bandwidth: "1G"
allowVolumeExpansion: true

This setup ensures your model uses vLLM for efficient serving and Portworx for caching, optimizing performance and storage reliability.

Get Started with Portworx

When it comes to AI/ML workflows like model inference, fast access to data is essential. Model caching with a leading Kubernetes storage and data management solution like Portworx is critical to accelerating these workflows. To learn more about Portworx, get started with a free trial, or reach out to us for a dedicated conversation on your AI/ML challenges today.