NVIDIA’s infrastructure team manages large compute farms supporting chip design, simulation, and AI workloads. By implementing Portworx Enterprise and PX-Backup, NVIDIA improved resilience, simplified storage management, and enabled a self-service model for hundreds of applications across its Kubernetes environments.

The Challenge

NVIDIA’s infrastructure services team supports chip design, simulation, and AI workloads across four data centers and multiple cloud environments. With hundreds of containerized applications and massive compute demands, the team needed a way to manage persistent storage more effectively, reduce downtime, and provide developers with self-service access to infrastructure resources – all while maintaining enterprise-grade reliability.

The Solution

NVIDIA implemented Portworx Enterprise on its on-premises DAS infrastructure to enable resilient, enterprise-grade storage for Kubernetes. Portworx was selected for its replication, fault tolerance, and operational simplicity. The team also deployed PX-Backup to protect persistent volumes and leverage cloud snapshots for recovery.

The solution provides self-service storage classes for developers, automates management, and allows cluster updates with minimal downtime – critical for NVIDIA’s high-volume chip design AI and simulation workloads. NVIDIA also evaluated Portworx on Pure FlashArray for future expansion into managed storage environments.

The Results

With Portworx, NVIDIA now operates over 500 applications across hundreds of nodes. Portworx has helped the team improve reliability, availability, and manageability – achieving ~99.9% uptime including maintenance for these critical chip design workflows.

The platform supports advanced use cases such as multi-node Redis clusters, AI model training, and large-scale chip simulations. NVIDIA continues to expand globally, with new data centers planned to leverage Portworx for scalability and resilience.

“Development teams don’t need to think about what’s under the covers. Once it’s up and running we need availability…we have zero-down time maintenance, Portworx allows us to shift workloads to different nodes without any impact on our users”

Brian Monroe, Senior Software Engineer, NVIDIA