We're starting a new blog series today called Architect's Corner. In this series we'll talk…
March 23, 2017
Architect’s Corner: Nelson Kick, HPC Engineer at TGen
In today’s Architect’s Corner, we talk with Nelson Kick, manager of High Performance Computing at TGen, a leading Genomics research institute that operates at petabytes of storage and 50 teraflops of compute. Nelson discusses how TGen uses containers in the context of High Performance Computing and some lessons learned running stateful services like MongoDB, etcd and Jenkins in containers. If you enjoy Nelson’s insights, you should know that we’ve highlighted TGen, one of Portworx’s earliest customers, previously. If you want to learn more, you can read a case study features TGen CIO James Lowey. We’ve also include a video interview with James at the end of this post.
Key technologies discussed:
Container runtime – Docker
Datacenter – On-prem, AWS
Scheduler – Custom, Kubernetes
Stateful Services – MongoDB, etcd, Jenkins
What does TGen do?
TGen is a not-for-profit research company that does translational genomics, that’s a very broad term so let me explain. Translational genomics research is relatively new coming out of the Human Genome Project. Basically we apply genomics to the development of diagnostics and therapies for cancer, neurological disorders, diabetes and other complex diseases.
TGen applies genomics to the development of diagnostics and therapies for cancer, neurological disorders, diabetes and other complex diseases.
Recently, we partnered with the City Of Hope out of LA, which is an independent research, cancer and diabetes treatment center. This gives TGEN a clinical setting to advance scientific discoveries we make.
Can tell us a little about your role at TGen?
I’ve been at TGen for five years. I was initially hired to be an High Performance Computing Engineer for a new project that they were starting with Dell. It was a collaboration, a donation of a bunch of hardware and money from Dell, to work on a specific childhood disorder. That was roughly five years ago, and then two years ago I was promoted into the HPC manager job here at TGen for all the clusters that we have company-wide.
My basic responsibilities are day-to-day administration, overseeing the three other people in our HPC group, and all the clusters, storage systems, networks, applications related to scientific high-performance computing at TGen.
How are you you’re using containers at TGen?
So a little background about our environment first. Each genome sequence is about 1 terabyte of data. So we’re peta scale, meaning, petabytes of storage with about 50 teraflops of compute across hundreds of nodes. That sounds like a lot, and it is, but believe it or not, though, we’re on the small side compared to most research centers and government. But we still have the IT infrastructure of a Fortune 100, between the storage capacity, the networking and the compute. We rival most larger companies just in our computational resources.
For portability, we want to be able to take an application or a group of applications in what we call a pipeline, which is a workflow that does a specific genomic task, and run it on any node at TGen or one of our research partners.
The primary reason we’re using containers is for portability and process isolation. For portability, we want to be able to take an application or a group of applications in what we call a pipeline, which is a workflow that does a specific genomic task, and run it on any node at TGen or one of our research partners. This is important because in research you need to be able to confirm research results and that means you need to be able to run the workload multiple times, but not always on the same server.
Isolation is also particularly important. We don’t write 90% of the code that we run. So we have no real way to control what it does, it’s written either by a grad student or researcher. It’s open source for the most part, so we have no way of knowing how it is going to perform. And we don’t have a lot of programmers here, so we’re not gonna start hacking the source. But if we can wrap it in a container, at least that way we control the resources so it doesn’t go crazy. Containers give us a way to take code of varying quality and get it to run somewhat efficiently.
Containers give us a way to take code of varying quality and get it to run somewhat efficiently.
We are also looking at using containers for bursting into the cloud. We’re testing a couple of different ways to do that. Either directly with Kubernetes or with a third-party Cycle computing, which is kind of like a gateway to the cloud where you present everything on your end, you play with their fancy GUI and it pushes everything up to either Amazon or Google or whatever it may be.
What were some challenges you needed to overcome in order to run stateful services like databases, queues, key-value storage in containers?
Using existing infrastructure, separate SANs had to be set up to store the incoming genomic data. With Portworx, the same set of machines can be used store and process genomic sequencing data. Today it costs us about a thousand bucks a terabyte. Using Portworx, containers, and more commodity-type hardware, we think we can reduce that by half, maybe more.
Today it costs us about a thousand bucks a terabyte. Using Portworx, containers, and more commodity-type hardware, we think we can reduce that by half, maybe more.
Also, the performance has allowed near baremetal speeds that enable us to more quickly process and run our analysis workloads.
Going forward, we plan to implement Portworx in our genomic sequencing, which would allow us to reuse physical infrastructure. At that point, the container becomes the definition of the workload and we can quickly repurpose and elastically expand capacity.
What advice would you give someone considering running stateful containers in production?
My advise would be to know your data. You need to determine the characteristics of your data such as how it interacts with your applications and environment, be it software or physical infrastructure that is running on. Once you know the patterns and workflows of your data, you’ll be able to tailor the storage system for optimum results.
Want to learn more? Catch up on previous interviews from Architect’s Corner
Interview with TGen CIO, James Lowey.