In today’s Architect’s Corner, we speak with Cris Fairweather, an Architect and Engineer at WCG Solutions. WCG Solutions is a diverse group of individuals comprised primarily of engineering (civil, electrical, mechanical), environmental, and system lifecycle management personnel who have worked on various maritime, communications, and general engineering projects spanning over 30 years. WCG provides a variety of engineering and systems management expertise and delivers practical engineering, which at its core brings concepts from the drawing board to reality.
Key technologies discussed
Infrastructure – On-premises
Container runtime – Docker
Orchestration – Rancher
Stateful Services – MongoDB, PostgreSQL, MySQL, Atlassian products, GitLab, Grafana, InfluxDB
Tell us a little bit about the company that you work for?
WCG is a defense contractor based in San Diego. We provide a lot of services for our government customers which include the Navy and the Department of Defense. I spend most of my time onsite with our customers, doing architecture planning, development, and maintenance. The current project I’m working on for a major military branch is building and maintaining a suite of web-based collaboration tools, both commercial products, and in-house applications. These web apps are meant to facilitate communication across research and development organizations within the government. For instance, we are building an in-house version of Slack to communicate across teams but we also deploy Atlassian tooling, Confluence and Jira, and GitLab as well, for the development side.
Tell us a little bit about your role?
I’m a programmer by trade, but I also do Linux administration and on many projects, I play the role of Architect. Recently, I’ve found myself involved in the architectural process for all of our new deployments as well as updating our old deployments. Thankfully, the burden of responsibility for the total architecture is slightly reduced because I don’t have to deal with a network or hosting infrastructure, but everything else from kernel to the front end is my responsibility right now.
Tell us a little bit about how you’re leveraging containers and microservices as a part of this communications architecture?
Containers provide a much greater overall situational awareness of what you’re running in production which is hugely important with large scale systems like the ones used by the military.
One of the significant problems with pretty much any infrastructure I’ve come up against is the overwhelming technical debt of extremely poor configuration management. Poor configuration management leads to problems like not knowing which deployments are configured in which way, making updates very difficult. Or forgetting that some CRON job is running on some server somewhere that you didn’t know you critically relied on.
One of the main benefits containers brought us is an explicit documentation of every part of every service that we’re using. So we’re taking our old system with all that technical debt and putting it into this concrete, defined, well-documented service that we can then easily deploy and manage. We’re using containers to achieve a much nicer situational awareness for everything from web apps to their databases, to even deploying Open LDAP, logging aggregation systems, Grafana, and InfluxDB for our monitoring system.
We use containers in production for everything right now.
The result is that we’re not relying on configuration management to be done at the system level other than, “This host needs to run Docker.” So we use our container scheduling system to gather metrics and perform load balancing across all of our systems. No one should have to be aware of how a specific host is configured; it should be dumb, it should be reusable, it should be immutable, and our scheduling system (we currently use Rancher) should be the provider for all of the infrastructure services and applications deployed on the hosts.
Talk about some of the challenges that you faced when you started to deploy stateful services in containers?
Most of our services require persistence. On the application side, we’re running a MongoDB database as the store for our Slack-like communication tool. We also run several MySQL services as the databases for Atlassian applications like Jira and Confluence, and for our Container-as-a-service platform.
On the infrastructure side, we need persistence for Open LDAP, for GitLab which uses a PostgreSQL server, and our Influx server where we need to store our statistics and metrics to preserve our historical data.
One of the challenges with running all this in containers is the overwhelming push to immutable infrastructure and the idea that no one host should be critical to our infrastructure, but you can’t get there when using local volumes.
We can try and run multiple containers on the same host using local volumes as the data backing for stateful containers, but then if that host goes down, you’re in big trouble. So we went on this year-long exploration to find some way of having an external data store for our container cluster that would meet the military-grade needs of the project.
One of the most popular solutions that we found was to run an NFS server. But that didn’t really do it for us because of security concerns with NFS. Out of the box, it doesn’t do the authentication that we needed beyond what Kerberos provides. Also, we don’t generally have access directly to any kind of SAN device or any of the major products you might find out there because we don’t have direct control over our data center. So we were really looking for a software-based solution to bring into our existing infrastructure to provide replicated and highly available Docker volumes to our infrastructure. We also looked at a few software defined solutions including GlusterFS, but they really didn’t fit our needs.
After spending a year testing and researching a solution for persistent storage for containers, we found Portworx, which in the locked-down, secure government environment that we find ourselves in seems to be the shoe-in for allowing us to utilize the resources we have to get the outcome we want.
Plus, the support team is very responsive and I really like little things like how Portworx maintains a Rancher Catalog which uses runC instead of Docker to minimize dependencies which can cause problems in production.
Another challenge that I alluded to earlier is that the need for situational awareness is so much greater when using containers versus VMs. Because of the way services can be so easily deployed into a cluster, you need to know where they’re running, you need to know how they’re performing, you need to know how the hosts are performing in direct relation to your containers. And you need to be able to easily bring those services back up if they fail.
Those are all things that you generally should have when running VMs, but because containers let you deploy so many more services per host, you need much more automation to make sure that you don’t lose control of your environment.
Things that you could probably get away with not doing on VMs, are essential on containers.
You really need to have good log aggregation and analysis and host level and container level monitoring. The difference there being, you would be monitoring the application directly if they were deployed directly on a host rather than a container.
It’s not that the containers themselves require more logging, monitoring, and automation than VMs. You’re just deploying a lot more services per host, and you’re probably often scaling more services than you were before, so you’re having many more processes running. Generally, containers are much more ephemeral than VMs, so you have to be more prepared to deal with failure, which is especially important for stateful services.
Knowing what you know now, what advice would you give someone else in your position who needs to deploy stateful containers in production?
Stop approaching the problem of stateful containers like you’re using VMs today. Stop pretending that you can do your application architecture exactly the same as you did before. That is just passing the buck down the road.
You need to loosely couple every component of your application and stop creating an interdependency between your services if you want to be successful with stateful containers.
And quite frankly, it’s easier for you in the long run to manage if you start de-coupling those components. That seems to be the biggest mistake I see people making these days: approaching the problem with the same mindset they have with traditional applications.