Continuous Deployment of On-Prem Kubernetes Infrastructure
by Joe Salisbury on Jul 10, 2017
With the creation of our Giantnetes stack, the team at Giant Swarm decided that we should continuously deploy our entire fancy new infrastructure. CI/CD is good, we should do more of it.
This post describes the first iteration of our deployment system, the challenges we faced with it, and our second iteration ‘pull-based’ deployment system.
Giant Swarm CI/CD Generation 1
All of our microservices had a directory in their repository, containing their Kubernetes resources - Deployment
, Service
, and the like. These files would contain some placeholders for values we would need to change as part of the deployment.
When a Pull Request was merged into master, we would automatically start a deployment as part of our CI/CD process. Docker images would be, and still are, tagged with the commit SHA hash.
We used sed
(high-tech, eh) to update the placeholders in the resources. For example, in the case of the aws-operator
, the field %%DOCKER_TAG%%
in the Deployment
would be replaced with quay.io/giantswarm/aws-operator:001f672aff35a470f4c9d59c07fb3df86ed3e028
. This also applied to some other fields - host names in Ingress
resources, for example.
(We use Quay for our registry nowadays. It’s great. Would recommend.)
We would then use kubectl apply
against a set of pre-configured installations, all of which had Internet accessible Kubernetes host cluster APIs. This would update the Kubernetes resources, and bring the new services online, for all of our clusters.
This system let us get started with CI/CD for our infrastructure, even if it is a little hacky in places. (I secretly loved it).
Turns out not everything is on the Internet
This ‘push-based’ approach, while fairly stable and quick to put together, does not work in cases where the Kubernetes host cluster API is not accessible from the build server. This is the case for a number of on-premise clusters. This new requirement of needing to deploy to host clusters, which are not accessible to the Internet, is what spurred this larger update to our deployment system.
We have largely phased out the ‘push-based’ approach now, excluding some of our internal services. We are aiming to migrate them shortly.
For the curious, we use CircleCI for our CI/CD platform. This is 100% because they said they’d send the mob after me if we didn’t.
Problem Definition
In brief, the requirements of the problem we now need to solve are:
- New versions of our services should be available as quickly as possible. We currently have more than 20 microservices, and this is growing.
- Services need to be deployed to a large, and growing, number of installations. We should be able to support around 100 installations comfortably.
And the new requirement of:
- Being able to support deployments to clusters not accessible via the Internet.
To be specific, this refers to the host cluster’s Kubernetes API not being available via the Internet. This requirement comes from our customers, most often for security and compliance. Our deployment system assumes that the host cluster is already set up, and makes heavy use of Kubernetes tooling.
High-Level
If we can’t push updates to clusters, can clusters pull updates from us?
The high-level idea is to introduce an agent that runs on the host cluster, periodically checks for new versions, and installs them into the host cluster.
This agent is our new open-source service, draughtsman.
Pull Based Deployments
Each service now has a Helm Chart instead of the custom resource directory. We started using Helm Charts as a format because we needed a package for storing our Kubernetes resources. It makes much more sense to us to make use of tooling already available in the community, as opposed to rolling our own.
When we deploy, we push the Helm Chart (as well as the freshly built Docker image) to Quay. We use their fancy new Application Registry support (see their announcement) to store our Charts. This allows us to push Charts as part of the deployment, and pull them later.
The build then creates GitHub Deployments for each configured cluster. With these GitHub Deployments we use two particular keys: environment
and ref
. environment
refers to the Giantnetes installation, and ref
refers to the commit SHA hash of the project.
For example, if the aws-operator
repository has a GitHub Deployment with the environment
jabberwocky
and the ref
001f672aff35a470f4c9d59c07fb3df86ed3e028
, then the aws-operator
chart at that version should be deployed to the jabberwocky
installation.
Draughtsman
draughtsman
runs as a Pod
inside each of our Giantnetes installations, and is configured with the name of the installation (jabberwocky
, from the above example). It periodically polls the GitHub API and checks if there are new GitHub Deployments for each of its configured projects.
When a new GitHub Deployment is created, draughtsman
pulls the Helm Chart from Quay, and installs it into the host cluster. draughtsman
uses the Helm client internally, which talks to Tiller.
Configuration Management
When designing the second iteration (with the goal of being able to support around 100 installations) it quickly became clear that configuration management would quickly eat us all.
The configuration of all services are now held in a single ConfigMap
, which draughtsman
reads when installing a Helm Chart, and forms the ‘values file’. Helm then uses this for templating. We are working on further formalizing this configuration management.
Open Problems and Future Development
Feature branch deployment is a feature that the team requested, which isn’t currently supported very well. The rough idea is to build and push images and Helm Charts for every commit, and provide a method for deploying these to test clusters by creating a GitHub Deployment for the test build.
Our build tool currently has a list of every installation and which projects should be installed there, to create the GitHub Deployments. While not bad, this feels slightly wrong. I would like to move this out into another system. There is also room here for enabling further integration testing before deployment.
Our current goal was to comfortably support around 100 installations with this system. When we have more than 100, we’re most likely going to look at how we support 1000 installations. Coming up!
Conclusion
In a nutshell, this is our current deployment system, and we’re aiming to open-source as much of it as possible. It’s cool.
You May Also Like
These Related Stories
Part 3: Deploying the Application with Helm
An in-depth series on how to easily get centralized logging, better security, performance metrics, and authentication using a Kubernetes-based platfor …
GitOps with Flux
GitOps is getting a lot of attention in the cloud-native community, and in the previous article in this series, we explored the features of ArgoCD. In …
Managed Cloud Native Stack - How Giant Swarm Does Cloud
A lot of us at Giant Swarm were at KubeCon in Copenhagen back in May. As well as being 3 times the size of the previous edition in Berlin the atmosphe …